Re: readchars, seek back, and readchars again
Hi Joe, I was able to run the code you posted and reproduced the exact same result (Rakudo version 2020.02.1..1 built on MoarVM version 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a bit (e.g. UTF8-C8), but I didn't see any improvement. Yary has an issue posted regarding 'display-width' of UTF-16 encoded strings: https://github.com/rakudo/rakudo/issues/3461 I know it might be far-fetched, but what if your UTF-8 issue and Yary's UTF-16 issue were related? It would be nice to kill two birds with one stone. Best Regards, Bill. On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner wrote: > > Another version of my test code, checking .tell throughout: > > use v6; > use Test; > > my $tmpdir = IO::Spec::Unix.tmpdir; > my $file = "$tmpdir/scratch_file.txt"; > my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]"; # > ሀⶀ䷼ꪪⲤⲎ > my $ascii_str = "ABCDEFGHI"; > > test_read_and_read_again($unichar_str, $file, 3); > test_read_and_read_again($ascii_str, $file, 0); > > # write given string to file, then read the third character twice and check > sub test_read_and_read_again($str, $file, $nudge = 0) { > spurt $file, $str; > my $fh = $file.IO.open; > printf "%d: just opened\n", $fh.tell; > $fh.readchars(2); # skip a few > printf "%d: after skipping 2\n", $fh.tell; > my $chr_1 = $fh.readchars(1); > printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1; > my $width = $chr_1.encode('UTF-8').bytes; # for our purposes, always 1 > or 3 > my $step_back = $width + $nudge; > $fh.seek: -$step_back, SeekFromCurrent; > printf "%d: after seeking back %d\n", $fh.tell, $step_back; > my $chr_2 = $fh.readchars(1); > printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2; > is( $chr_1, $chr_2, > "read, seek back, and read again gets same char with nudge of $nudge" > ); > } > > > The output looks like so: > > /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6 > 0: just opened > 9: after skipping 2 > 12: after reading 3rd: ䷼ > 6: after seeking back 6 > 12: after re-reading 3rd: ䷼ > ok 1 - read, seek back, and read again gets same char with nudge of 3 > 0: just opened > 2: after skipping 2 > 3: after reading 3rd: C > 2: after seeking back 1 > 3: after re-reading 3rd: C > ok 2 - read, seek back, and read again gets same char with nudge of 0 > > It's really hard to see what I should do if I really wanted to > intermix readchars and seeks like this... I'd need to check the range > of the codepoint to see how far I need to seek to get where I expect > to be. > > > > On 4/24/20, Joseph Brenner wrote: > > Thanks, yes I understand unicode and utf-8 reasonably well. > > > >> So Rakudo has to read the next codepoint to make sure that it isn't a > >> combining codepoint. > > > >> It is probably faking up the reads to look right when reading ASCII, but > >> failing to do that for wider codepoints. > > > > I think it'd be the other way around... the idea here would be it's > > doing an extra readchar behind the scenes just in-case there's > > combining chars involved-- so you're figuring there's some confusion > > about the actual point in the file that's being read and the > > abstraction that readchars is supplying? > > > > > > On 4/24/20, Brad Gilbert wrote: > >> In UTF8 characters can be 1 to 4 bytes long. > >> > >> UTF8 was designed so that 7-bit ASCII is a subset of it. > >> > >> Any 8bit byte that has its most significant bit set cannot be ASCII. > >> So multi-byte codepoints have the most significant bit set for all of the > >> bytes. > >> The first byte can tell you the number of bytes that follow it. > >> > >> That is how a singe codepoint is stored. > >> > >> A character can be made of several codepoints. > >> > >> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]" > >> "é" > >> > >> So Rakudo has to read the next codepoint to make sure that it isn't a > >> combining codepoint. > >> > >> It is probably faking up the reads to look right when reading ASCII, but > >> failing to do that for wider codepoints. > >> > >> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner wrote: > >> > >>> I thought that doing a readchars on a filehandle, seeking backwards > >>> the width of the char in bytes and then doing another read > >>> would always get the same character. That works for ascii-range > >>> characters (1-byte in utf-8 encoding) but not multi-byte "wide" > >>> characters (commonly 3-bytes in utf-8). > >>> > >>> The question then, is why do I need a $nudge of 3 for wide chars, but > >>> not ascii-range ones? > >>> > >>> use v6; > >>> use Test; > >>> > >>> my $tmpdir = IO::Spec::Unix.tmpdir; > >>> my $file = "$tmpdir/scratch_file.txt"; > >>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]"; # > >>> ሀⶀ䷼ꪪⲤⲎ > >>> my $ascii_str = "ABCDEFGHI"; > >>> > >>> subtest { > >>> my $nudge = 3; > >>> test_read_and_read_again($unichar_str, $file, $nudge); > >>> }, "Wide unicode chars:
Re: readchars, seek back, and readchars again
I was just posting that. On 4/24/20, Elizabeth Mattijsen wrote: > > >> On 24 Apr 2020, at 22:03, Joseph Brenner wrote: >> >> Thanks, yes I understand unicode and utf-8 reasonably well. >> >>> So Rakudo has to read the next codepoint to make sure that it isn't a >>> combining codepoint. >> >>> It is probably faking up the reads to look right when reading ASCII, but >>> failing to do that for wider codepoints. >> >> I think it'd be the other way around... the idea here would be it's >> doing an extra readchar behind the scenes just in-case there's >> combining chars involved-- so you're figuring there's some confusion >> about the actual point in the file that's being read and the >> abstraction that readchars is supplying? > > What does .tell say before and after the readchars? >
Re: readchars, seek back, and readchars again
Another version of my test code, checking .tell throughout: use v6; use Test; my $tmpdir = IO::Spec::Unix.tmpdir; my $file = "$tmpdir/scratch_file.txt"; my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]"; # ሀⶀ䷼ꪪⲤⲎ my $ascii_str = "ABCDEFGHI"; test_read_and_read_again($unichar_str, $file, 3); test_read_and_read_again($ascii_str, $file, 0); # write given string to file, then read the third character twice and check sub test_read_and_read_again($str, $file, $nudge = 0) { spurt $file, $str; my $fh = $file.IO.open; printf "%d: just opened\n", $fh.tell; $fh.readchars(2); # skip a few printf "%d: after skipping 2\n", $fh.tell; my $chr_1 = $fh.readchars(1); printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1; my $width = $chr_1.encode('UTF-8').bytes; # for our purposes, always 1 or 3 my $step_back = $width + $nudge; $fh.seek: -$step_back, SeekFromCurrent; printf "%d: after seeking back %d\n", $fh.tell, $step_back; my $chr_2 = $fh.readchars(1); printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2; is( $chr_1, $chr_2, "read, seek back, and read again gets same char with nudge of $nudge" ); } The output looks like so: /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6 0: just opened 9: after skipping 2 12: after reading 3rd: ䷼ 6: after seeking back 6 12: after re-reading 3rd: ䷼ ok 1 - read, seek back, and read again gets same char with nudge of 3 0: just opened 2: after skipping 2 3: after reading 3rd: C 2: after seeking back 1 3: after re-reading 3rd: C ok 2 - read, seek back, and read again gets same char with nudge of 0 It's really hard to see what I should do if I really wanted to intermix readchars and seeks like this... I'd need to check the range of the codepoint to see how far I need to seek to get where I expect to be. On 4/24/20, Joseph Brenner wrote: > Thanks, yes I understand unicode and utf-8 reasonably well. > >> So Rakudo has to read the next codepoint to make sure that it isn't a >> combining codepoint. > >> It is probably faking up the reads to look right when reading ASCII, but >> failing to do that for wider codepoints. > > I think it'd be the other way around... the idea here would be it's > doing an extra readchar behind the scenes just in-case there's > combining chars involved-- so you're figuring there's some confusion > about the actual point in the file that's being read and the > abstraction that readchars is supplying? > > > On 4/24/20, Brad Gilbert wrote: >> In UTF8 characters can be 1 to 4 bytes long. >> >> UTF8 was designed so that 7-bit ASCII is a subset of it. >> >> Any 8bit byte that has its most significant bit set cannot be ASCII. >> So multi-byte codepoints have the most significant bit set for all of the >> bytes. >> The first byte can tell you the number of bytes that follow it. >> >> That is how a singe codepoint is stored. >> >> A character can be made of several codepoints. >> >> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]" >> "é" >> >> So Rakudo has to read the next codepoint to make sure that it isn't a >> combining codepoint. >> >> It is probably faking up the reads to look right when reading ASCII, but >> failing to do that for wider codepoints. >> >> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner wrote: >> >>> I thought that doing a readchars on a filehandle, seeking backwards >>> the width of the char in bytes and then doing another read >>> would always get the same character. That works for ascii-range >>> characters (1-byte in utf-8 encoding) but not multi-byte "wide" >>> characters (commonly 3-bytes in utf-8). >>> >>> The question then, is why do I need a $nudge of 3 for wide chars, but >>> not ascii-range ones? >>> >>> use v6; >>> use Test; >>> >>> my $tmpdir = IO::Spec::Unix.tmpdir; >>> my $file = "$tmpdir/scratch_file.txt"; >>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]"; # >>> ሀⶀ䷼ꪪⲤⲎ >>> my $ascii_str = "ABCDEFGHI"; >>> >>> subtest { >>> my $nudge = 3; >>> test_read_and_read_again($unichar_str, $file, $nudge); >>> }, "Wide unicode chars: $unichar_str"; >>> >>> subtest { >>> my $nudge = 0; >>> test_read_and_read_again($ascii_str, $file, $nudge); >>> }, "Ascii-range chars: $ascii_str"; >>> >>> # write given string to file, then read the third character twice and >>> check >>> sub test_read_and_read_again($str, $file, $nudge = 0) { >>> spurt $file, $str; >>> my $fh = $file.IO.open; >>> $fh.readchars(2); # skip a few >>> my $chr_1 = $fh.readchars(1); >>> my $width = $chr_1.encode('UTF-8').bytes; # for our purposes, >>> always >>> 1 or 3 >>> my $step_back = $width + $nudge; >>> $fh.seek: -$step_back, SeekFromCurrent; >>> my $chr_2 = $fh.readchars(1); >>> is( $chr_1, $chr_2, >>> "read, seek back, and read again gets same char with nudge of >>> $nudge" ); >>> } >>> >> >
Re: readchars, seek back, and readchars again
> On 24 Apr 2020, at 22:03, Joseph Brenner wrote: > > Thanks, yes I understand unicode and utf-8 reasonably well. > >> So Rakudo has to read the next codepoint to make sure that it isn't a >> combining codepoint. > >> It is probably faking up the reads to look right when reading ASCII, but >> failing to do that for wider codepoints. > > I think it'd be the other way around... the idea here would be it's > doing an extra readchar behind the scenes just in-case there's > combining chars involved-- so you're figuring there's some confusion > about the actual point in the file that's being read and the > abstraction that readchars is supplying? What does .tell say before and after the readchars?
Re: readchars, seek back, and readchars again
Thanks, yes I understand unicode and utf-8 reasonably well. > So Rakudo has to read the next codepoint to make sure that it isn't a > combining codepoint. > It is probably faking up the reads to look right when reading ASCII, but > failing to do that for wider codepoints. I think it'd be the other way around... the idea here would be it's doing an extra readchar behind the scenes just in-case there's combining chars involved-- so you're figuring there's some confusion about the actual point in the file that's being read and the abstraction that readchars is supplying? On 4/24/20, Brad Gilbert wrote: > In UTF8 characters can be 1 to 4 bytes long. > > UTF8 was designed so that 7-bit ASCII is a subset of it. > > Any 8bit byte that has its most significant bit set cannot be ASCII. > So multi-byte codepoints have the most significant bit set for all of the > bytes. > The first byte can tell you the number of bytes that follow it. > > That is how a singe codepoint is stored. > > A character can be made of several codepoints. > > "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]" > "é" > > So Rakudo has to read the next codepoint to make sure that it isn't a > combining codepoint. > > It is probably faking up the reads to look right when reading ASCII, but > failing to do that for wider codepoints. > > On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner wrote: > >> I thought that doing a readchars on a filehandle, seeking backwards >> the width of the char in bytes and then doing another read >> would always get the same character. That works for ascii-range >> characters (1-byte in utf-8 encoding) but not multi-byte "wide" >> characters (commonly 3-bytes in utf-8). >> >> The question then, is why do I need a $nudge of 3 for wide chars, but >> not ascii-range ones? >> >> use v6; >> use Test; >> >> my $tmpdir = IO::Spec::Unix.tmpdir; >> my $file = "$tmpdir/scratch_file.txt"; >> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]"; # >> ሀⶀ䷼ꪪⲤⲎ >> my $ascii_str = "ABCDEFGHI"; >> >> subtest { >> my $nudge = 3; >> test_read_and_read_again($unichar_str, $file, $nudge); >> }, "Wide unicode chars: $unichar_str"; >> >> subtest { >> my $nudge = 0; >> test_read_and_read_again($ascii_str, $file, $nudge); >> }, "Ascii-range chars: $ascii_str"; >> >> # write given string to file, then read the third character twice and >> check >> sub test_read_and_read_again($str, $file, $nudge = 0) { >> spurt $file, $str; >> my $fh = $file.IO.open; >> $fh.readchars(2); # skip a few >> my $chr_1 = $fh.readchars(1); >> my $width = $chr_1.encode('UTF-8').bytes; # for our purposes, always >> 1 or 3 >> my $step_back = $width + $nudge; >> $fh.seek: -$step_back, SeekFromCurrent; >> my $chr_2 = $fh.readchars(1); >> is( $chr_1, $chr_2, >> "read, seek back, and read again gets same char with nudge of >> $nudge" ); >> } >> >
Re: readchars, seek back, and readchars again
In UTF8 characters can be 1 to 4 bytes long. UTF8 was designed so that 7-bit ASCII is a subset of it. Any 8bit byte that has its most significant bit set cannot be ASCII. So multi-byte codepoints have the most significant bit set for all of the bytes. The first byte can tell you the number of bytes that follow it. That is how a singe codepoint is stored. A character can be made of several codepoints. "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]" "é" So Rakudo has to read the next codepoint to make sure that it isn't a combining codepoint. It is probably faking up the reads to look right when reading ASCII, but failing to do that for wider codepoints. On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner wrote: > I thought that doing a readchars on a filehandle, seeking backwards > the width of the char in bytes and then doing another read > would always get the same character. That works for ascii-range > characters (1-byte in utf-8 encoding) but not multi-byte "wide" > characters (commonly 3-bytes in utf-8). > > The question then, is why do I need a $nudge of 3 for wide chars, but > not ascii-range ones? > > use v6; > use Test; > > my $tmpdir = IO::Spec::Unix.tmpdir; > my $file = "$tmpdir/scratch_file.txt"; > my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]"; # > ሀⶀ䷼ꪪⲤⲎ > my $ascii_str = "ABCDEFGHI"; > > subtest { > my $nudge = 3; > test_read_and_read_again($unichar_str, $file, $nudge); > }, "Wide unicode chars: $unichar_str"; > > subtest { > my $nudge = 0; > test_read_and_read_again($ascii_str, $file, $nudge); > }, "Ascii-range chars: $ascii_str"; > > # write given string to file, then read the third character twice and check > sub test_read_and_read_again($str, $file, $nudge = 0) { > spurt $file, $str; > my $fh = $file.IO.open; > $fh.readchars(2); # skip a few > my $chr_1 = $fh.readchars(1); > my $width = $chr_1.encode('UTF-8').bytes; # for our purposes, always > 1 or 3 > my $step_back = $width + $nudge; > $fh.seek: -$step_back, SeekFromCurrent; > my $chr_2 = $fh.readchars(1); > is( $chr_1, $chr_2, > "read, seek back, and read again gets same char with nudge of > $nudge" ); > } >
readchars, seek back, and readchars again
I thought that doing a readchars on a filehandle, seeking backwards the width of the char in bytes and then doing another read would always get the same character. That works for ascii-range characters (1-byte in utf-8 encoding) but not multi-byte "wide" characters (commonly 3-bytes in utf-8). The question then, is why do I need a $nudge of 3 for wide chars, but not ascii-range ones? use v6; use Test; my $tmpdir = IO::Spec::Unix.tmpdir; my $file = "$tmpdir/scratch_file.txt"; my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]"; # ሀⶀ䷼ꪪⲤⲎ my $ascii_str = "ABCDEFGHI"; subtest { my $nudge = 3; test_read_and_read_again($unichar_str, $file, $nudge); }, "Wide unicode chars: $unichar_str"; subtest { my $nudge = 0; test_read_and_read_again($ascii_str, $file, $nudge); }, "Ascii-range chars: $ascii_str"; # write given string to file, then read the third character twice and check sub test_read_and_read_again($str, $file, $nudge = 0) { spurt $file, $str; my $fh = $file.IO.open; $fh.readchars(2); # skip a few my $chr_1 = $fh.readchars(1); my $width = $chr_1.encode('UTF-8').bytes; # for our purposes, always 1 or 3 my $step_back = $width + $nudge; $fh.seek: -$step_back, SeekFromCurrent; my $chr_2 = $fh.readchars(1); is( $chr_1, $chr_2, "read, seek back, and read again gets same char with nudge of $nudge" ); }