Re: readchars, seek back, and readchars again

2020-04-28 Thread Samantha McVey
On zaterdag 25 april 2020 21:51:41 CEST Joseph Brenner wrote:
> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded 
strings:
> >  https://github.com/rakudo/rakudo/issues/3461
> >  
> >  I know it might be far-fetched, but what if your UTF-8 issue and
> 
> Yary's UTF-16 issue were related
> 
> Well, an issue with handling combining characters could easily effect
> both, nothing about it is specific to one encoding. Yary's issue
> doesn't have to do with reading from disk though, he's just looking at
> the raw bytes the encoding generates.
> 
> On 4/24/20, William Michels  wrote:
> > Hi Joe,
> > 
> > I was able to run the code you posted and reproduced the exact same
> > result (Rakudo version 2020.02.1..1 built on MoarVM version
> > 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a
> > bit
> > (e.g. UTF8-C8), but I didn't see any improvement.
> > 
> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded
> > strings:
> > 
> > https://github.com/rakudo/rakudo/issues/3461
> > 
> > I know it might be far-fetched, but what if your UTF-8 issue and
> > Yary's UTF-16 issue were related? It would be nice to kill two birds
> > with one stone.
> > 
> > Best Regards, Bill.
> > 
> > On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner  wrote:
> >> Another version of my test code, checking .tell throughout:
> >> 
> >> use v6;
> >> use Test;
> >> 
> >> my $tmpdir = IO::Spec::Unix.tmpdir;
> >> my $file = "$tmpdir/scratch_file.txt";
> >> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  #
> >> ሀⶀ䷼ꪪⲤⲎ
> >> my $ascii_str =   "ABCDEFGHI";
> >> 
> >> test_read_and_read_again($unichar_str, $file, 3);
> >> test_read_and_read_again($ascii_str,   $file, 0);
> >> 
> >> # write given string to file, then read the third character twice and
> >> check
> >> sub test_read_and_read_again($str, $file, $nudge = 0) {
> >> 
> >> spurt $file, $str;
> >> my $fh = $file.IO.open;
> >> printf "%d: just opened\n", $fh.tell;
> >> $fh.readchars(2);  # skip a few
> >> printf "%d: after skipping 2\n", $fh.tell;
> >> my $chr_1 =  $fh.readchars(1);
> >> printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
> >> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always
> >> 
> >> 1 or 3
> >> 
> >> my $step_back = $width + $nudge;
> >> $fh.seek: -$step_back, SeekFromCurrent;
> >> printf "%d: after seeking back %d\n", $fh.tell, $step_back;
> >> my $chr_2 =  $fh.readchars(1);
> >> printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
> >> is( $chr_1, $chr_2,
> >> 
> >> "read, seek back, and read again gets same char with nudge of
> >> 
> >> $nudge" );
> >> }
> >> 
> >> 
> >> The output looks like so:
> >> 
> >> /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6
> >> 0: just opened
> >> 9: after skipping 2
> >> 12: after reading 3rd: ䷼
> >> 6: after seeking back 6
> >> 12: after re-reading 3rd: ䷼
> >> ok 1 - read, seek back, and read again gets same char with nudge of 3
> >> 0: just opened
> >> 2: after skipping 2
> >> 3: after reading 3rd: C
> >> 2: after seeking back 1
> >> 3: after re-reading 3rd: C
> >> ok 2 - read, seek back, and read again gets same char with nudge of 0
> >> 
> >> It's really hard to see what I should do if I really wanted to
> >> intermix readchars and seeks like this... I'd need to check the range
> >> of the codepoint to see how far I need to seek to get where I expect
> >> to be.
> >> 
> >> On 4/24/20, Joseph Brenner  wrote:
> >> > Thanks, yes I understand unicode and utf-8 reasonably well.
> >> > 
> >> >> So Rakudo has to read the next codepoint to make sure that it isn't a
> >> >> combining codepoint.
> >> >> 
> >> >> It is probably faking up the reads to look right when reading ASCII,
> >> >> but
> >> >> failing to do that for wider codepoints.
> >> > 
> >> > I think it'd be the other way around... the idea here would be it's
> >> > doing an extra readchar behind the scenes just in-case there's
> >> > combining chars involved-- so you're figuring there's some confusion
> >> > about the actual point in the file that's being read and the
> >> > abstraction that readchars is supplying?
> >> > 
> >> > On 4/24/20, Brad Gilbert  wrote:
> >> >> In UTF8 characters can be 1 to 4 bytes long.
> >> >> 
> >> >> UTF8 was designed so that 7-bit ASCII is a subset of it.
> >> >> 
> >> >> Any 8bit byte that has its most significant bit set cannot be ASCII.
> >> >> So multi-byte codepoints have the most significant bit set for all of
> >> >> the
> >> >> bytes.
> >> >> The first byte can tell you the number of bytes that follow it.
> >> >> 
> >> >> That is how a singe codepoint is stored.
> >> >> 
> >> >> A character can be made of several codepoints.
> >> >> 
> >> >> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
> >> >> "é"
> >> >> 
> >> >> So Rakudo has to read the next codepoint to make sure that it isn't a
> >> >> combining codepoint.
> >> >> 
> >> >> 

Re: readchars, seek back, and readchars again

2020-04-28 Thread Samantha McVey
On maandag 27 april 2020 09:49:20 CEST Joseph Brenner wrote:
> After you do a .readchars, what point in the file would you expect to
> be "current"?  I would expect it would be the point right after the
> last char read.  Instead that's true if you're reading ascii
> characters but not unicode characters up above the ascii range, in
> which case the "current" point is larger than that (in the cases I've
> looked at, larger by 3 bytes).
> 
> If you try to intermix readchars with calls to .seek using the
> "SeekFromCurrent" feature, it can be tricky to predict where you're
> going to end up, because the point you're starting at depends on what
> kind text you've been reading, not just the number of bytes you've
> read.
> 
> Is that making any sense?  I posted a later code example that might
> show the problem more clearly...
> 
> On 4/26/20, Samantha McVey  wrote:
> > On zaterdag 25 april 2020 21:51:41 CEST Joseph Brenner wrote:
> >> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded
> > 
> > strings:
> >> >  https://github.com/rakudo/rakudo/issues/3461
> >> >  
> >> >  I know it might be far-fetched, but what if your UTF-8 issue and
> >> 
> >> Yary's UTF-16 issue were related
> >> 
> >> Well, an issue with handling combining characters could easily effect
> >> both, nothing about it is specific to one encoding. Yary's issue
> >> doesn't have to do with reading from disk though, he's just looking at
> >> the raw bytes the encoding generates.
> >> 
> >> On 4/24/20, William Michels  wrote:
> >> > Hi Joe,
> >> > 
> >> > I was able to run the code you posted and reproduced the exact same
> >> > result (Rakudo version 2020.02.1..1 built on MoarVM version
> >> > 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a
> >> > bit
> >> > (e.g. UTF8-C8), but I didn't see any improvement.
> >> > 
> >> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded
> >> > strings:
> >> > 
> >> > https://github.com/rakudo/rakudo/issues/3461
> >> > 
> >> > I know it might be far-fetched, but what if your UTF-8 issue and
> >> > Yary's UTF-16 issue were related? It would be nice to kill two birds
> >> > with one stone.
> >> > 
> >> > Best Regards, Bill.
> >> > 
> >> > On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner 
> >> > 
> >> > wrote:
> >> >> Another version of my test code, checking .tell throughout:
> >> >> 
> >> >> use v6;
> >> >> use Test;
> >> >> 
> >> >> my $tmpdir = IO::Spec::Unix.tmpdir;
> >> >> my $file = "$tmpdir/scratch_file.txt";
> >> >> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";
> >> >> #
> >> >> ሀⶀ䷼ꪪⲤⲎ
> >> >> my $ascii_str =   "ABCDEFGHI";
> >> >> 
> >> >> test_read_and_read_again($unichar_str, $file, 3);
> >> >> test_read_and_read_again($ascii_str,   $file, 0);
> >> >> 
> >> >> # write given string to file, then read the third character twice and
> >> >> check
> >> >> sub test_read_and_read_again($str, $file, $nudge = 0) {
> >> >> 
> >> >> spurt $file, $str;
> >> >> my $fh = $file.IO.open;
> >> >> printf "%d: just opened\n", $fh.tell;
> >> >> $fh.readchars(2);  # skip a few
> >> >> printf "%d: after skipping 2\n", $fh.tell;
> >> >> my $chr_1 =  $fh.readchars(1);
> >> >> printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
> >> >> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes,
> >> >> 
> >> >> always
> >> >> 
> >> >> 1 or 3
> >> >> 
> >> >> my $step_back = $width + $nudge;
> >> >> $fh.seek: -$step_back, SeekFromCurrent;
> >> >> printf "%d: after seeking back %d\n", $fh.tell, $step_back;
> >> >> my $chr_2 =  $fh.readchars(1);
> >> >> printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
> >> >> is( $chr_1, $chr_2,
> >> >> 
> >> >> "read, seek back, and read again gets same char with nudge of
> >> >> 
> >> >> $nudge" );
> >> >> }
> >> >> 
> >> >> 
> >> >> The output looks like so:
> >> >> 
> >> >> /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6
> >> >> 0: just opened
> >> >> 9: after skipping 2
> >> >> 12: after reading 3rd: ䷼
> >> >> 6: after seeking back 6
> >> >> 12: after re-reading 3rd: ䷼
> >> >> ok 1 - read, seek back, and read again gets same char with nudge of 3
> >> >> 0: just opened
> >> >> 2: after skipping 2
> >> >> 3: after reading 3rd: C
> >> >> 2: after seeking back 1
> >> >> 3: after re-reading 3rd: C
> >> >> ok 2 - read, seek back, and read again gets same char with nudge of 0
> >> >> 
> >> >> It's really hard to see what I should do if I really wanted to
> >> >> intermix readchars and seeks like this... I'd need to check the range
> >> >> of the codepoint to see how far I need to seek to get where I expect
> >> >> to be.
> >> >> 
> >> >> On 4/24/20, Joseph Brenner  wrote:
> >> >> > Thanks, yes I understand unicode and utf-8 reasonably well.
> >> >> > 
> >> >> >> So Rakudo has to read the next codepoint to make sure that it isn't
> >> >> >> a
> >> >> >> combining codepoint.
> >> >> >> 
> >> >> >> It is 

Re: readchars, seek back, and readchars again

2020-04-27 Thread Joseph Brenner
After you do a .readchars, what point in the file would you expect to
be "current"?  I would expect it would be the point right after the
last char read.  Instead that's true if you're reading ascii
characters but not unicode characters up above the ascii range, in
which case the "current" point is larger than that (in the cases I've
looked at, larger by 3 bytes).

If you try to intermix readchars with calls to .seek using the
"SeekFromCurrent" feature, it can be tricky to predict where you're
going to end up, because the point you're starting at depends on what
kind text you've been reading, not just the number of bytes you've
read.

Is that making any sense?  I posted a later code example that might
show the problem more clearly...

On 4/26/20, Samantha McVey  wrote:
> On zaterdag 25 april 2020 21:51:41 CEST Joseph Brenner wrote:
>> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded
> strings:
>> >  https://github.com/rakudo/rakudo/issues/3461
>> >
>> >  I know it might be far-fetched, but what if your UTF-8 issue and
>>
>> Yary's UTF-16 issue were related
>>
>> Well, an issue with handling combining characters could easily effect
>> both, nothing about it is specific to one encoding. Yary's issue
>> doesn't have to do with reading from disk though, he's just looking at
>> the raw bytes the encoding generates.
>>
>> On 4/24/20, William Michels  wrote:
>> > Hi Joe,
>> >
>> > I was able to run the code you posted and reproduced the exact same
>> > result (Rakudo version 2020.02.1..1 built on MoarVM version
>> > 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a
>> > bit
>> > (e.g. UTF8-C8), but I didn't see any improvement.
>> >
>> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded
>> > strings:
>> >
>> > https://github.com/rakudo/rakudo/issues/3461
>> >
>> > I know it might be far-fetched, but what if your UTF-8 issue and
>> > Yary's UTF-16 issue were related? It would be nice to kill two birds
>> > with one stone.
>> >
>> > Best Regards, Bill.
>> >
>> > On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner 
>> > wrote:
>> >> Another version of my test code, checking .tell throughout:
>> >>
>> >> use v6;
>> >> use Test;
>> >>
>> >> my $tmpdir = IO::Spec::Unix.tmpdir;
>> >> my $file = "$tmpdir/scratch_file.txt";
>> >> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";
>> >> #
>> >> ሀⶀ䷼ꪪⲤⲎ
>> >> my $ascii_str =   "ABCDEFGHI";
>> >>
>> >> test_read_and_read_again($unichar_str, $file, 3);
>> >> test_read_and_read_again($ascii_str,   $file, 0);
>> >>
>> >> # write given string to file, then read the third character twice and
>> >> check
>> >> sub test_read_and_read_again($str, $file, $nudge = 0) {
>> >>
>> >> spurt $file, $str;
>> >> my $fh = $file.IO.open;
>> >> printf "%d: just opened\n", $fh.tell;
>> >> $fh.readchars(2);  # skip a few
>> >> printf "%d: after skipping 2\n", $fh.tell;
>> >> my $chr_1 =  $fh.readchars(1);
>> >> printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
>> >> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes,
>> >> always
>> >>
>> >> 1 or 3
>> >>
>> >> my $step_back = $width + $nudge;
>> >> $fh.seek: -$step_back, SeekFromCurrent;
>> >> printf "%d: after seeking back %d\n", $fh.tell, $step_back;
>> >> my $chr_2 =  $fh.readchars(1);
>> >> printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
>> >> is( $chr_1, $chr_2,
>> >>
>> >> "read, seek back, and read again gets same char with nudge of
>> >>
>> >> $nudge" );
>> >> }
>> >>
>> >>
>> >> The output looks like so:
>> >>
>> >> /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6
>> >> 0: just opened
>> >> 9: after skipping 2
>> >> 12: after reading 3rd: ䷼
>> >> 6: after seeking back 6
>> >> 12: after re-reading 3rd: ䷼
>> >> ok 1 - read, seek back, and read again gets same char with nudge of 3
>> >> 0: just opened
>> >> 2: after skipping 2
>> >> 3: after reading 3rd: C
>> >> 2: after seeking back 1
>> >> 3: after re-reading 3rd: C
>> >> ok 2 - read, seek back, and read again gets same char with nudge of 0
>> >>
>> >> It's really hard to see what I should do if I really wanted to
>> >> intermix readchars and seeks like this... I'd need to check the range
>> >> of the codepoint to see how far I need to seek to get where I expect
>> >> to be.
>> >>
>> >> On 4/24/20, Joseph Brenner  wrote:
>> >> > Thanks, yes I understand unicode and utf-8 reasonably well.
>> >> >
>> >> >> So Rakudo has to read the next codepoint to make sure that it isn't
>> >> >> a
>> >> >> combining codepoint.
>> >> >>
>> >> >> It is probably faking up the reads to look right when reading
>> >> >> ASCII,
>> >> >> but
>> >> >> failing to do that for wider codepoints.
>> >> >
>> >> > I think it'd be the other way around... the idea here would be it's
>> >> > doing an extra readchar behind the scenes just in-case there's
>> >> > combining chars involved-- so you're figuring there's some confusion
>> >> 

Re: readchars, seek back, and readchars again

2020-04-26 Thread Joseph Brenner
I decided to open an issue for this one.  Even if there's no practical
fix for the behavior of readchars, I'd think this odd meaning of the
"current" point in the file would need to be better documented:

  https://github.com/rakudo/rakudo/issues/3646

I simplified the test I've been using:

use v6;
use Test;

my $tmpdir = IO::Spec::Unix.tmpdir;

# ሀⶀ䷼ꪪⲤⲎ
my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";
my $unichar_file = "$tmpdir/six_unicode_chars.txt";
spurt $unichar_file, $unichar_str;

my $ascii_str =   "ABCDEFGHI";
my $ascii_file = "$tmpdir/nine_ascii_chars.txt";
spurt $ascii_file, $ascii_str;


{
my $fh= $ascii_file.IO.open;
my $loc1  = $fh.tell;
my $char_count = 3;
my $str = readchars_no_advance($fh, $char_count);
my $loc2  = $fh.tell;
is( $loc1, $loc2,
"Testing that readchars file position works as expected for
ascii-range chars " );
}


{
my $fh= $unichar_file.IO.open;
my $loc1  = $fh.tell;
my $char_count = 3;
my $str = readchars_no_advance($fh, $char_count);
my $loc2  = $fh.tell;
is( $loc1, $loc2,
"Testing that readchars file position works as expected for
unichars beyond ascii-range" );
}

# After a readchar, this tries to return to the original position in the file
sub readchars_no_advance ($fh, $char_count) {
my $str   = $fh.readchars($char_count);
my $width = $str.encode('UTF-8').bytes;
$fh.seek: -$width, SeekFromCurrent;
return $str;
}




On 4/24/20, Brad Gilbert  wrote:
> In UTF8 characters can be 1 to 4 bytes long.
>
> UTF8 was designed so that 7-bit ASCII is a subset of it.
>
> Any 8bit byte that has its most significant bit set cannot be ASCII.
> So multi-byte codepoints have the most significant bit set for all of the
> bytes.
> The first byte can tell you the number of bytes that follow it.
>
> That is how a singe codepoint is stored.
>
> A character can be made of several codepoints.
>
> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
> "é"
>
> So Rakudo has to read the next codepoint to make sure that it isn't a
> combining codepoint.
>
> It is probably faking up the reads to look right when reading ASCII, but
> failing to do that for wider codepoints.
>
> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner  wrote:
>
>> I thought that doing a readchars on a filehandle, seeking backwards
>> the width of the char in bytes and then doing another read
>> would always get the same character.  That works for ascii-range
>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
>> characters (commonly 3-bytes in utf-8).
>>
>> The question then, is why do I need a $nudge of 3 for wide chars, but
>> not ascii-range ones?
>>
>> use v6;
>> use Test;
>>
>> my $tmpdir = IO::Spec::Unix.tmpdir;
>> my $file = "$tmpdir/scratch_file.txt";
>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  #
>> ሀⶀ䷼ꪪⲤⲎ
>> my $ascii_str =   "ABCDEFGHI";
>>
>> subtest {
>> my $nudge = 3;
>> test_read_and_read_again($unichar_str, $file, $nudge);
>> }, "Wide unicode chars: $unichar_str";
>>
>> subtest {
>> my $nudge = 0;
>> test_read_and_read_again($ascii_str, $file, $nudge);
>> }, "Ascii-range chars: $ascii_str";
>>
>> # write given string to file, then read the third character twice and
>> check
>> sub test_read_and_read_again($str, $file, $nudge = 0) {
>> spurt $file, $str;
>> my $fh = $file.IO.open;
>> $fh.readchars(2);  # skip a few
>> my $chr_1 =  $fh.readchars(1);
>> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always
>> 1 or 3
>> my $step_back = $width + $nudge;
>> $fh.seek: -$step_back, SeekFromCurrent;
>> my $chr_2 =  $fh.readchars(1);
>> is( $chr_1, $chr_2,
>> "read, seek back, and read again gets same char with nudge of
>> $nudge" );
>> }
>>
>


Re: readchars, seek back, and readchars again

2020-04-25 Thread Joseph Brenner
> Yary has an issue posted regarding 'display-width' of UTF-16 encoded strings:

>  https://github.com/rakudo/rakudo/issues/3461

>  I know it might be far-fetched, but what if your UTF-8 issue and
Yary's UTF-16 issue were related

Well, an issue with handling combining characters could easily effect
both, nothing about it is specific to one encoding. Yary's issue
doesn't have to do with reading from disk though, he's just looking at
the raw bytes the encoding generates.



On 4/24/20, William Michels  wrote:
> Hi Joe,
>
> I was able to run the code you posted and reproduced the exact same
> result (Rakudo version 2020.02.1..1 built on MoarVM version
> 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a bit
> (e.g. UTF8-C8), but I didn't see any improvement.
>
> Yary has an issue posted regarding 'display-width' of UTF-16 encoded
> strings:
>
> https://github.com/rakudo/rakudo/issues/3461
>
> I know it might be far-fetched, but what if your UTF-8 issue and
> Yary's UTF-16 issue were related? It would be nice to kill two birds
> with one stone.
>
> Best Regards, Bill.
>
>
>
>
> On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner  wrote:
>>
>> Another version of my test code, checking .tell throughout:
>>
>> use v6;
>> use Test;
>>
>> my $tmpdir = IO::Spec::Unix.tmpdir;
>> my $file = "$tmpdir/scratch_file.txt";
>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  #
>> ሀⶀ䷼ꪪⲤⲎ
>> my $ascii_str =   "ABCDEFGHI";
>>
>> test_read_and_read_again($unichar_str, $file, 3);
>> test_read_and_read_again($ascii_str,   $file, 0);
>>
>> # write given string to file, then read the third character twice and
>> check
>> sub test_read_and_read_again($str, $file, $nudge = 0) {
>> spurt $file, $str;
>> my $fh = $file.IO.open;
>> printf "%d: just opened\n", $fh.tell;
>> $fh.readchars(2);  # skip a few
>> printf "%d: after skipping 2\n", $fh.tell;
>> my $chr_1 =  $fh.readchars(1);
>> printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
>> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always
>> 1 or 3
>> my $step_back = $width + $nudge;
>> $fh.seek: -$step_back, SeekFromCurrent;
>> printf "%d: after seeking back %d\n", $fh.tell, $step_back;
>> my $chr_2 =  $fh.readchars(1);
>> printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
>> is( $chr_1, $chr_2,
>> "read, seek back, and read again gets same char with nudge of
>> $nudge" );
>> }
>>
>>
>> The output looks like so:
>>
>> /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6
>> 0: just opened
>> 9: after skipping 2
>> 12: after reading 3rd: ䷼
>> 6: after seeking back 6
>> 12: after re-reading 3rd: ䷼
>> ok 1 - read, seek back, and read again gets same char with nudge of 3
>> 0: just opened
>> 2: after skipping 2
>> 3: after reading 3rd: C
>> 2: after seeking back 1
>> 3: after re-reading 3rd: C
>> ok 2 - read, seek back, and read again gets same char with nudge of 0
>>
>> It's really hard to see what I should do if I really wanted to
>> intermix readchars and seeks like this... I'd need to check the range
>> of the codepoint to see how far I need to seek to get where I expect
>> to be.
>>
>>
>>
>> On 4/24/20, Joseph Brenner  wrote:
>> > Thanks, yes I understand unicode and utf-8 reasonably well.
>> >
>> >> So Rakudo has to read the next codepoint to make sure that it isn't a
>> >> combining codepoint.
>> >
>> >> It is probably faking up the reads to look right when reading ASCII,
>> >> but
>> >> failing to do that for wider codepoints.
>> >
>> > I think it'd be the other way around... the idea here would be it's
>> > doing an extra readchar behind the scenes just in-case there's
>> > combining chars involved-- so you're figuring there's some confusion
>> > about the actual point in the file that's being read and the
>> > abstraction that readchars is supplying?
>> >
>> >
>> > On 4/24/20, Brad Gilbert  wrote:
>> >> In UTF8 characters can be 1 to 4 bytes long.
>> >>
>> >> UTF8 was designed so that 7-bit ASCII is a subset of it.
>> >>
>> >> Any 8bit byte that has its most significant bit set cannot be ASCII.
>> >> So multi-byte codepoints have the most significant bit set for all of
>> >> the
>> >> bytes.
>> >> The first byte can tell you the number of bytes that follow it.
>> >>
>> >> That is how a singe codepoint is stored.
>> >>
>> >> A character can be made of several codepoints.
>> >>
>> >> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
>> >> "é"
>> >>
>> >> So Rakudo has to read the next codepoint to make sure that it isn't a
>> >> combining codepoint.
>> >>
>> >> It is probably faking up the reads to look right when reading ASCII,
>> >> but
>> >> failing to do that for wider codepoints.
>> >>
>> >> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner 
>> >> wrote:
>> >>
>> >>> I thought that doing a readchars on a filehandle, seeking backwards
>> >>> the width of the char in bytes and then doing another read
>> >>> would always 

Re: readchars, seek back, and readchars again

2020-04-24 Thread William Michels via perl6-users
Hi Joe,

I was able to run the code you posted and reproduced the exact same
result (Rakudo version 2020.02.1..1 built on MoarVM version
2020.02.1 implementing Raku 6.d). I tried playing with file encodings a bit
(e.g. UTF8-C8), but I didn't see any improvement.

Yary has an issue posted regarding 'display-width' of UTF-16 encoded strings:

https://github.com/rakudo/rakudo/issues/3461

I know it might be far-fetched, but what if your UTF-8 issue and
Yary's UTF-16 issue were related? It would be nice to kill two birds
with one stone.

Best Regards, Bill.




On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner  wrote:
>
> Another version of my test code, checking .tell throughout:
>
> use v6;
> use Test;
>
> my $tmpdir = IO::Spec::Unix.tmpdir;
> my $file = "$tmpdir/scratch_file.txt";
> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  # 
> ሀⶀ䷼ꪪⲤⲎ
> my $ascii_str =   "ABCDEFGHI";
>
> test_read_and_read_again($unichar_str, $file, 3);
> test_read_and_read_again($ascii_str,   $file, 0);
>
> # write given string to file, then read the third character twice and check
> sub test_read_and_read_again($str, $file, $nudge = 0) {
> spurt $file, $str;
> my $fh = $file.IO.open;
> printf "%d: just opened\n", $fh.tell;
> $fh.readchars(2);  # skip a few
> printf "%d: after skipping 2\n", $fh.tell;
> my $chr_1 =  $fh.readchars(1);
> printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always 1 
> or 3
> my $step_back = $width + $nudge;
> $fh.seek: -$step_back, SeekFromCurrent;
> printf "%d: after seeking back %d\n", $fh.tell, $step_back;
> my $chr_2 =  $fh.readchars(1);
> printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
> is( $chr_1, $chr_2,
> "read, seek back, and read again gets same char with nudge of $nudge" 
> );
> }
>
>
> The output looks like so:
>
> /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6
> 0: just opened
> 9: after skipping 2
> 12: after reading 3rd: ䷼
> 6: after seeking back 6
> 12: after re-reading 3rd: ䷼
> ok 1 - read, seek back, and read again gets same char with nudge of 3
> 0: just opened
> 2: after skipping 2
> 3: after reading 3rd: C
> 2: after seeking back 1
> 3: after re-reading 3rd: C
> ok 2 - read, seek back, and read again gets same char with nudge of 0
>
> It's really hard to see what I should do if I really wanted to
> intermix readchars and seeks like this... I'd need to check the range
> of the codepoint to see how far I need to seek to get where I expect
> to be.
>
>
>
> On 4/24/20, Joseph Brenner  wrote:
> > Thanks, yes I understand unicode and utf-8 reasonably well.
> >
> >> So Rakudo has to read the next codepoint to make sure that it isn't a
> >> combining codepoint.
> >
> >> It is probably faking up the reads to look right when reading ASCII, but
> >> failing to do that for wider codepoints.
> >
> > I think it'd be the other way around... the idea here would be it's
> > doing an extra readchar behind the scenes just in-case there's
> > combining chars involved-- so you're figuring there's some confusion
> > about the actual point in the file that's being read and the
> > abstraction that readchars is supplying?
> >
> >
> > On 4/24/20, Brad Gilbert  wrote:
> >> In UTF8 characters can be 1 to 4 bytes long.
> >>
> >> UTF8 was designed so that 7-bit ASCII is a subset of it.
> >>
> >> Any 8bit byte that has its most significant bit set cannot be ASCII.
> >> So multi-byte codepoints have the most significant bit set for all of the
> >> bytes.
> >> The first byte can tell you the number of bytes that follow it.
> >>
> >> That is how a singe codepoint is stored.
> >>
> >> A character can be made of several codepoints.
> >>
> >> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
> >> "é"
> >>
> >> So Rakudo has to read the next codepoint to make sure that it isn't a
> >> combining codepoint.
> >>
> >> It is probably faking up the reads to look right when reading ASCII, but
> >> failing to do that for wider codepoints.
> >>
> >> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner  wrote:
> >>
> >>> I thought that doing a readchars on a filehandle, seeking backwards
> >>> the width of the char in bytes and then doing another read
> >>> would always get the same character.  That works for ascii-range
> >>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
> >>> characters (commonly 3-bytes in utf-8).
> >>>
> >>> The question then, is why do I need a $nudge of 3 for wide chars, but
> >>> not ascii-range ones?
> >>>
> >>> use v6;
> >>> use Test;
> >>>
> >>> my $tmpdir = IO::Spec::Unix.tmpdir;
> >>> my $file = "$tmpdir/scratch_file.txt";
> >>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  #
> >>> ሀⶀ䷼ꪪⲤⲎ
> >>> my $ascii_str =   "ABCDEFGHI";
> >>>
> >>> subtest {
> >>> my $nudge = 3;
> >>> test_read_and_read_again($unichar_str, $file, $nudge);
> >>> }, "Wide unicode chars: 

Re: readchars, seek back, and readchars again

2020-04-24 Thread Joseph Brenner
I was just posting that.

On 4/24/20, Elizabeth Mattijsen  wrote:
>
>
>> On 24 Apr 2020, at 22:03, Joseph Brenner  wrote:
>>
>> Thanks, yes I understand unicode and utf-8 reasonably well.
>>
>>> So Rakudo has to read the next codepoint to make sure that it isn't a
>>> combining codepoint.
>>
>>> It is probably faking up the reads to look right when reading ASCII, but
>>> failing to do that for wider codepoints.
>>
>> I think it'd be the other way around... the idea here would be it's
>> doing an extra readchar behind the scenes just in-case there's
>> combining chars involved-- so you're figuring there's some confusion
>> about the actual point in the file that's being read and the
>> abstraction that readchars is supplying?
>
> What does .tell say before and after the readchars?
>


Re: readchars, seek back, and readchars again

2020-04-24 Thread Joseph Brenner
Another version of my test code, checking .tell throughout:

use v6;
use Test;

my $tmpdir = IO::Spec::Unix.tmpdir;
my $file = "$tmpdir/scratch_file.txt";
my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  # ሀⶀ䷼ꪪⲤⲎ
my $ascii_str =   "ABCDEFGHI";

test_read_and_read_again($unichar_str, $file, 3);
test_read_and_read_again($ascii_str,   $file, 0);

# write given string to file, then read the third character twice and check
sub test_read_and_read_again($str, $file, $nudge = 0) {
spurt $file, $str;
my $fh = $file.IO.open;
printf "%d: just opened\n", $fh.tell;
$fh.readchars(2);  # skip a few
printf "%d: after skipping 2\n", $fh.tell;
my $chr_1 =  $fh.readchars(1);
printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always 1 or 3
my $step_back = $width + $nudge;
$fh.seek: -$step_back, SeekFromCurrent;
printf "%d: after seeking back %d\n", $fh.tell, $step_back;
my $chr_2 =  $fh.readchars(1);
printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
is( $chr_1, $chr_2,
"read, seek back, and read again gets same char with nudge of $nudge" );
}


The output looks like so:

/home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6
0: just opened
9: after skipping 2
12: after reading 3rd: ䷼
6: after seeking back 6
12: after re-reading 3rd: ䷼
ok 1 - read, seek back, and read again gets same char with nudge of 3
0: just opened
2: after skipping 2
3: after reading 3rd: C
2: after seeking back 1
3: after re-reading 3rd: C
ok 2 - read, seek back, and read again gets same char with nudge of 0

It's really hard to see what I should do if I really wanted to
intermix readchars and seeks like this... I'd need to check the range
of the codepoint to see how far I need to seek to get where I expect
to be.



On 4/24/20, Joseph Brenner  wrote:
> Thanks, yes I understand unicode and utf-8 reasonably well.
>
>> So Rakudo has to read the next codepoint to make sure that it isn't a
>> combining codepoint.
>
>> It is probably faking up the reads to look right when reading ASCII, but
>> failing to do that for wider codepoints.
>
> I think it'd be the other way around... the idea here would be it's
> doing an extra readchar behind the scenes just in-case there's
> combining chars involved-- so you're figuring there's some confusion
> about the actual point in the file that's being read and the
> abstraction that readchars is supplying?
>
>
> On 4/24/20, Brad Gilbert  wrote:
>> In UTF8 characters can be 1 to 4 bytes long.
>>
>> UTF8 was designed so that 7-bit ASCII is a subset of it.
>>
>> Any 8bit byte that has its most significant bit set cannot be ASCII.
>> So multi-byte codepoints have the most significant bit set for all of the
>> bytes.
>> The first byte can tell you the number of bytes that follow it.
>>
>> That is how a singe codepoint is stored.
>>
>> A character can be made of several codepoints.
>>
>> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
>> "é"
>>
>> So Rakudo has to read the next codepoint to make sure that it isn't a
>> combining codepoint.
>>
>> It is probably faking up the reads to look right when reading ASCII, but
>> failing to do that for wider codepoints.
>>
>> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner  wrote:
>>
>>> I thought that doing a readchars on a filehandle, seeking backwards
>>> the width of the char in bytes and then doing another read
>>> would always get the same character.  That works for ascii-range
>>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
>>> characters (commonly 3-bytes in utf-8).
>>>
>>> The question then, is why do I need a $nudge of 3 for wide chars, but
>>> not ascii-range ones?
>>>
>>> use v6;
>>> use Test;
>>>
>>> my $tmpdir = IO::Spec::Unix.tmpdir;
>>> my $file = "$tmpdir/scratch_file.txt";
>>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  #
>>> ሀⶀ䷼ꪪⲤⲎ
>>> my $ascii_str =   "ABCDEFGHI";
>>>
>>> subtest {
>>> my $nudge = 3;
>>> test_read_and_read_again($unichar_str, $file, $nudge);
>>> }, "Wide unicode chars: $unichar_str";
>>>
>>> subtest {
>>> my $nudge = 0;
>>> test_read_and_read_again($ascii_str, $file, $nudge);
>>> }, "Ascii-range chars: $ascii_str";
>>>
>>> # write given string to file, then read the third character twice and
>>> check
>>> sub test_read_and_read_again($str, $file, $nudge = 0) {
>>> spurt $file, $str;
>>> my $fh = $file.IO.open;
>>> $fh.readchars(2);  # skip a few
>>> my $chr_1 =  $fh.readchars(1);
>>> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes,
>>> always
>>> 1 or 3
>>> my $step_back = $width + $nudge;
>>> $fh.seek: -$step_back, SeekFromCurrent;
>>> my $chr_2 =  $fh.readchars(1);
>>> is( $chr_1, $chr_2,
>>> "read, seek back, and read again gets same char with nudge of
>>> $nudge" );
>>> }
>>>
>>
>


Re: readchars, seek back, and readchars again

2020-04-24 Thread Elizabeth Mattijsen



> On 24 Apr 2020, at 22:03, Joseph Brenner  wrote:
> 
> Thanks, yes I understand unicode and utf-8 reasonably well.
> 
>> So Rakudo has to read the next codepoint to make sure that it isn't a 
>> combining codepoint.
> 
>> It is probably faking up the reads to look right when reading ASCII, but 
>> failing to do that for wider codepoints.
> 
> I think it'd be the other way around... the idea here would be it's
> doing an extra readchar behind the scenes just in-case there's
> combining chars involved-- so you're figuring there's some confusion
> about the actual point in the file that's being read and the
> abstraction that readchars is supplying?

What does .tell say before and after the readchars?


Re: readchars, seek back, and readchars again

2020-04-24 Thread Joseph Brenner
Thanks, yes I understand unicode and utf-8 reasonably well.

> So Rakudo has to read the next codepoint to make sure that it isn't a 
> combining codepoint.

> It is probably faking up the reads to look right when reading ASCII, but 
> failing to do that for wider codepoints.

I think it'd be the other way around... the idea here would be it's
doing an extra readchar behind the scenes just in-case there's
combining chars involved-- so you're figuring there's some confusion
about the actual point in the file that's being read and the
abstraction that readchars is supplying?


On 4/24/20, Brad Gilbert  wrote:
> In UTF8 characters can be 1 to 4 bytes long.
>
> UTF8 was designed so that 7-bit ASCII is a subset of it.
>
> Any 8bit byte that has its most significant bit set cannot be ASCII.
> So multi-byte codepoints have the most significant bit set for all of the
> bytes.
> The first byte can tell you the number of bytes that follow it.
>
> That is how a singe codepoint is stored.
>
> A character can be made of several codepoints.
>
> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
> "é"
>
> So Rakudo has to read the next codepoint to make sure that it isn't a
> combining codepoint.
>
> It is probably faking up the reads to look right when reading ASCII, but
> failing to do that for wider codepoints.
>
> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner  wrote:
>
>> I thought that doing a readchars on a filehandle, seeking backwards
>> the width of the char in bytes and then doing another read
>> would always get the same character.  That works for ascii-range
>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
>> characters (commonly 3-bytes in utf-8).
>>
>> The question then, is why do I need a $nudge of 3 for wide chars, but
>> not ascii-range ones?
>>
>> use v6;
>> use Test;
>>
>> my $tmpdir = IO::Spec::Unix.tmpdir;
>> my $file = "$tmpdir/scratch_file.txt";
>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  #
>> ሀⶀ䷼ꪪⲤⲎ
>> my $ascii_str =   "ABCDEFGHI";
>>
>> subtest {
>> my $nudge = 3;
>> test_read_and_read_again($unichar_str, $file, $nudge);
>> }, "Wide unicode chars: $unichar_str";
>>
>> subtest {
>> my $nudge = 0;
>> test_read_and_read_again($ascii_str, $file, $nudge);
>> }, "Ascii-range chars: $ascii_str";
>>
>> # write given string to file, then read the third character twice and
>> check
>> sub test_read_and_read_again($str, $file, $nudge = 0) {
>> spurt $file, $str;
>> my $fh = $file.IO.open;
>> $fh.readchars(2);  # skip a few
>> my $chr_1 =  $fh.readchars(1);
>> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always
>> 1 or 3
>> my $step_back = $width + $nudge;
>> $fh.seek: -$step_back, SeekFromCurrent;
>> my $chr_2 =  $fh.readchars(1);
>> is( $chr_1, $chr_2,
>> "read, seek back, and read again gets same char with nudge of
>> $nudge" );
>> }
>>
>


Re: readchars, seek back, and readchars again

2020-04-24 Thread Brad Gilbert
In UTF8 characters can be 1 to 4 bytes long.

UTF8 was designed so that 7-bit ASCII is a subset of it.

Any 8bit byte that has its most significant bit set cannot be ASCII.
So multi-byte codepoints have the most significant bit set for all of the
bytes.
The first byte can tell you the number of bytes that follow it.

That is how a singe codepoint is stored.

A character can be made of several codepoints.

"\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
"é"

So Rakudo has to read the next codepoint to make sure that it isn't a
combining codepoint.

It is probably faking up the reads to look right when reading ASCII, but
failing to do that for wider codepoints.

On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner  wrote:

> I thought that doing a readchars on a filehandle, seeking backwards
> the width of the char in bytes and then doing another read
> would always get the same character.  That works for ascii-range
> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
> characters (commonly 3-bytes in utf-8).
>
> The question then, is why do I need a $nudge of 3 for wide chars, but
> not ascii-range ones?
>
> use v6;
> use Test;
>
> my $tmpdir = IO::Spec::Unix.tmpdir;
> my $file = "$tmpdir/scratch_file.txt";
> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  #
> ሀⶀ䷼ꪪⲤⲎ
> my $ascii_str =   "ABCDEFGHI";
>
> subtest {
> my $nudge = 3;
> test_read_and_read_again($unichar_str, $file, $nudge);
> }, "Wide unicode chars: $unichar_str";
>
> subtest {
> my $nudge = 0;
> test_read_and_read_again($ascii_str, $file, $nudge);
> }, "Ascii-range chars: $ascii_str";
>
> # write given string to file, then read the third character twice and check
> sub test_read_and_read_again($str, $file, $nudge = 0) {
> spurt $file, $str;
> my $fh = $file.IO.open;
> $fh.readchars(2);  # skip a few
> my $chr_1 =  $fh.readchars(1);
> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always
> 1 or 3
> my $step_back = $width + $nudge;
> $fh.seek: -$step_back, SeekFromCurrent;
> my $chr_2 =  $fh.readchars(1);
> is( $chr_1, $chr_2,
> "read, seek back, and read again gets same char with nudge of
> $nudge" );
> }
>