Re: readchars, seek back, and readchars again

2020-04-24 Thread William Michels via perl6-users
Hi Joe,

I was able to run the code you posted and reproduced the exact same
result (Rakudo version 2020.02.1..1 built on MoarVM version
2020.02.1 implementing Raku 6.d). I tried playing with file encodings a bit
(e.g. UTF8-C8), but I didn't see any improvement.

Yary has an issue posted regarding 'display-width' of UTF-16 encoded strings:

https://github.com/rakudo/rakudo/issues/3461

I know it might be far-fetched, but what if your UTF-8 issue and
Yary's UTF-16 issue were related? It would be nice to kill two birds
with one stone.

Best Regards, Bill.




On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner  wrote:
>
> Another version of my test code, checking .tell throughout:
>
> use v6;
> use Test;
>
> my $tmpdir = IO::Spec::Unix.tmpdir;
> my $file = "$tmpdir/scratch_file.txt";
> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  # 
> ሀⶀ䷼ꪪⲤⲎ
> my $ascii_str =   "ABCDEFGHI";
>
> test_read_and_read_again($unichar_str, $file, 3);
> test_read_and_read_again($ascii_str,   $file, 0);
>
> # write given string to file, then read the third character twice and check
> sub test_read_and_read_again($str, $file, $nudge = 0) {
> spurt $file, $str;
> my $fh = $file.IO.open;
> printf "%d: just opened\n", $fh.tell;
> $fh.readchars(2);  # skip a few
> printf "%d: after skipping 2\n", $fh.tell;
> my $chr_1 =  $fh.readchars(1);
> printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always 1 
> or 3
> my $step_back = $width + $nudge;
> $fh.seek: -$step_back, SeekFromCurrent;
> printf "%d: after seeking back %d\n", $fh.tell, $step_back;
> my $chr_2 =  $fh.readchars(1);
> printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
> is( $chr_1, $chr_2,
> "read, seek back, and read again gets same char with nudge of $nudge" 
> );
> }
>
>
> The output looks like so:
>
> /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6
> 0: just opened
> 9: after skipping 2
> 12: after reading 3rd: ䷼
> 6: after seeking back 6
> 12: after re-reading 3rd: ䷼
> ok 1 - read, seek back, and read again gets same char with nudge of 3
> 0: just opened
> 2: after skipping 2
> 3: after reading 3rd: C
> 2: after seeking back 1
> 3: after re-reading 3rd: C
> ok 2 - read, seek back, and read again gets same char with nudge of 0
>
> It's really hard to see what I should do if I really wanted to
> intermix readchars and seeks like this... I'd need to check the range
> of the codepoint to see how far I need to seek to get where I expect
> to be.
>
>
>
> On 4/24/20, Joseph Brenner  wrote:
> > Thanks, yes I understand unicode and utf-8 reasonably well.
> >
> >> So Rakudo has to read the next codepoint to make sure that it isn't a
> >> combining codepoint.
> >
> >> It is probably faking up the reads to look right when reading ASCII, but
> >> failing to do that for wider codepoints.
> >
> > I think it'd be the other way around... the idea here would be it's
> > doing an extra readchar behind the scenes just in-case there's
> > combining chars involved-- so you're figuring there's some confusion
> > about the actual point in the file that's being read and the
> > abstraction that readchars is supplying?
> >
> >
> > On 4/24/20, Brad Gilbert  wrote:
> >> In UTF8 characters can be 1 to 4 bytes long.
> >>
> >> UTF8 was designed so that 7-bit ASCII is a subset of it.
> >>
> >> Any 8bit byte that has its most significant bit set cannot be ASCII.
> >> So multi-byte codepoints have the most significant bit set for all of the
> >> bytes.
> >> The first byte can tell you the number of bytes that follow it.
> >>
> >> That is how a singe codepoint is stored.
> >>
> >> A character can be made of several codepoints.
> >>
> >> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
> >> "é"
> >>
> >> So Rakudo has to read the next codepoint to make sure that it isn't a
> >> combining codepoint.
> >>
> >> It is probably faking up the reads to look right when reading ASCII, but
> >> failing to do that for wider codepoints.
> >>
> >> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner  wrote:
> >>
> >>> I thought that doing a readchars on a filehandle, seeking backwards
> >>> the width of the char in bytes and then doing another read
> >>> would always get the same character.  That works for ascii-range
> >>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
> >>> characters (commonly 3-bytes in utf-8).
> >>>
> >>> The question then, is why do I need a $nudge of 3 for wide chars, but
> >>> not ascii-range ones?
> >>>
> >>> use v6;
> >>> use Test;
> >>>
> >>> my $tmpdir = IO::Spec::Unix.tmpdir;
> >>> my $file = "$tmpdir/scratch_file.txt";
> >>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  #
> >>> ሀⶀ䷼ꪪⲤⲎ
> >>> my $ascii_str =   "ABCDEFGHI";
> >>>
> >>> subtest {
> >>> my $nudge = 3;
> >>> test_read_and_read_again($unichar_str, $file, $nudge);
> >>> }, "Wide unicode chars: 

Re: readchars, seek back, and readchars again

2020-04-24 Thread Joseph Brenner
I was just posting that.

On 4/24/20, Elizabeth Mattijsen  wrote:
>
>
>> On 24 Apr 2020, at 22:03, Joseph Brenner  wrote:
>>
>> Thanks, yes I understand unicode and utf-8 reasonably well.
>>
>>> So Rakudo has to read the next codepoint to make sure that it isn't a
>>> combining codepoint.
>>
>>> It is probably faking up the reads to look right when reading ASCII, but
>>> failing to do that for wider codepoints.
>>
>> I think it'd be the other way around... the idea here would be it's
>> doing an extra readchar behind the scenes just in-case there's
>> combining chars involved-- so you're figuring there's some confusion
>> about the actual point in the file that's being read and the
>> abstraction that readchars is supplying?
>
> What does .tell say before and after the readchars?
>


Re: readchars, seek back, and readchars again

2020-04-24 Thread Joseph Brenner
Another version of my test code, checking .tell throughout:

use v6;
use Test;

my $tmpdir = IO::Spec::Unix.tmpdir;
my $file = "$tmpdir/scratch_file.txt";
my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  # ሀⶀ䷼ꪪⲤⲎ
my $ascii_str =   "ABCDEFGHI";

test_read_and_read_again($unichar_str, $file, 3);
test_read_and_read_again($ascii_str,   $file, 0);

# write given string to file, then read the third character twice and check
sub test_read_and_read_again($str, $file, $nudge = 0) {
spurt $file, $str;
my $fh = $file.IO.open;
printf "%d: just opened\n", $fh.tell;
$fh.readchars(2);  # skip a few
printf "%d: after skipping 2\n", $fh.tell;
my $chr_1 =  $fh.readchars(1);
printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always 1 or 3
my $step_back = $width + $nudge;
$fh.seek: -$step_back, SeekFromCurrent;
printf "%d: after seeking back %d\n", $fh.tell, $step_back;
my $chr_2 =  $fh.readchars(1);
printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
is( $chr_1, $chr_2,
"read, seek back, and read again gets same char with nudge of $nudge" );
}


The output looks like so:

/home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6
0: just opened
9: after skipping 2
12: after reading 3rd: ䷼
6: after seeking back 6
12: after re-reading 3rd: ䷼
ok 1 - read, seek back, and read again gets same char with nudge of 3
0: just opened
2: after skipping 2
3: after reading 3rd: C
2: after seeking back 1
3: after re-reading 3rd: C
ok 2 - read, seek back, and read again gets same char with nudge of 0

It's really hard to see what I should do if I really wanted to
intermix readchars and seeks like this... I'd need to check the range
of the codepoint to see how far I need to seek to get where I expect
to be.



On 4/24/20, Joseph Brenner  wrote:
> Thanks, yes I understand unicode and utf-8 reasonably well.
>
>> So Rakudo has to read the next codepoint to make sure that it isn't a
>> combining codepoint.
>
>> It is probably faking up the reads to look right when reading ASCII, but
>> failing to do that for wider codepoints.
>
> I think it'd be the other way around... the idea here would be it's
> doing an extra readchar behind the scenes just in-case there's
> combining chars involved-- so you're figuring there's some confusion
> about the actual point in the file that's being read and the
> abstraction that readchars is supplying?
>
>
> On 4/24/20, Brad Gilbert  wrote:
>> In UTF8 characters can be 1 to 4 bytes long.
>>
>> UTF8 was designed so that 7-bit ASCII is a subset of it.
>>
>> Any 8bit byte that has its most significant bit set cannot be ASCII.
>> So multi-byte codepoints have the most significant bit set for all of the
>> bytes.
>> The first byte can tell you the number of bytes that follow it.
>>
>> That is how a singe codepoint is stored.
>>
>> A character can be made of several codepoints.
>>
>> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
>> "é"
>>
>> So Rakudo has to read the next codepoint to make sure that it isn't a
>> combining codepoint.
>>
>> It is probably faking up the reads to look right when reading ASCII, but
>> failing to do that for wider codepoints.
>>
>> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner  wrote:
>>
>>> I thought that doing a readchars on a filehandle, seeking backwards
>>> the width of the char in bytes and then doing another read
>>> would always get the same character.  That works for ascii-range
>>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
>>> characters (commonly 3-bytes in utf-8).
>>>
>>> The question then, is why do I need a $nudge of 3 for wide chars, but
>>> not ascii-range ones?
>>>
>>> use v6;
>>> use Test;
>>>
>>> my $tmpdir = IO::Spec::Unix.tmpdir;
>>> my $file = "$tmpdir/scratch_file.txt";
>>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  #
>>> ሀⶀ䷼ꪪⲤⲎ
>>> my $ascii_str =   "ABCDEFGHI";
>>>
>>> subtest {
>>> my $nudge = 3;
>>> test_read_and_read_again($unichar_str, $file, $nudge);
>>> }, "Wide unicode chars: $unichar_str";
>>>
>>> subtest {
>>> my $nudge = 0;
>>> test_read_and_read_again($ascii_str, $file, $nudge);
>>> }, "Ascii-range chars: $ascii_str";
>>>
>>> # write given string to file, then read the third character twice and
>>> check
>>> sub test_read_and_read_again($str, $file, $nudge = 0) {
>>> spurt $file, $str;
>>> my $fh = $file.IO.open;
>>> $fh.readchars(2);  # skip a few
>>> my $chr_1 =  $fh.readchars(1);
>>> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes,
>>> always
>>> 1 or 3
>>> my $step_back = $width + $nudge;
>>> $fh.seek: -$step_back, SeekFromCurrent;
>>> my $chr_2 =  $fh.readchars(1);
>>> is( $chr_1, $chr_2,
>>> "read, seek back, and read again gets same char with nudge of
>>> $nudge" );
>>> }
>>>
>>
>


Re: readchars, seek back, and readchars again

2020-04-24 Thread Elizabeth Mattijsen



> On 24 Apr 2020, at 22:03, Joseph Brenner  wrote:
> 
> Thanks, yes I understand unicode and utf-8 reasonably well.
> 
>> So Rakudo has to read the next codepoint to make sure that it isn't a 
>> combining codepoint.
> 
>> It is probably faking up the reads to look right when reading ASCII, but 
>> failing to do that for wider codepoints.
> 
> I think it'd be the other way around... the idea here would be it's
> doing an extra readchar behind the scenes just in-case there's
> combining chars involved-- so you're figuring there's some confusion
> about the actual point in the file that's being read and the
> abstraction that readchars is supplying?

What does .tell say before and after the readchars?


Re: readchars, seek back, and readchars again

2020-04-24 Thread Joseph Brenner
Thanks, yes I understand unicode and utf-8 reasonably well.

> So Rakudo has to read the next codepoint to make sure that it isn't a 
> combining codepoint.

> It is probably faking up the reads to look right when reading ASCII, but 
> failing to do that for wider codepoints.

I think it'd be the other way around... the idea here would be it's
doing an extra readchar behind the scenes just in-case there's
combining chars involved-- so you're figuring there's some confusion
about the actual point in the file that's being read and the
abstraction that readchars is supplying?


On 4/24/20, Brad Gilbert  wrote:
> In UTF8 characters can be 1 to 4 bytes long.
>
> UTF8 was designed so that 7-bit ASCII is a subset of it.
>
> Any 8bit byte that has its most significant bit set cannot be ASCII.
> So multi-byte codepoints have the most significant bit set for all of the
> bytes.
> The first byte can tell you the number of bytes that follow it.
>
> That is how a singe codepoint is stored.
>
> A character can be made of several codepoints.
>
> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
> "é"
>
> So Rakudo has to read the next codepoint to make sure that it isn't a
> combining codepoint.
>
> It is probably faking up the reads to look right when reading ASCII, but
> failing to do that for wider codepoints.
>
> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner  wrote:
>
>> I thought that doing a readchars on a filehandle, seeking backwards
>> the width of the char in bytes and then doing another read
>> would always get the same character.  That works for ascii-range
>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
>> characters (commonly 3-bytes in utf-8).
>>
>> The question then, is why do I need a $nudge of 3 for wide chars, but
>> not ascii-range ones?
>>
>> use v6;
>> use Test;
>>
>> my $tmpdir = IO::Spec::Unix.tmpdir;
>> my $file = "$tmpdir/scratch_file.txt";
>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  #
>> ሀⶀ䷼ꪪⲤⲎ
>> my $ascii_str =   "ABCDEFGHI";
>>
>> subtest {
>> my $nudge = 3;
>> test_read_and_read_again($unichar_str, $file, $nudge);
>> }, "Wide unicode chars: $unichar_str";
>>
>> subtest {
>> my $nudge = 0;
>> test_read_and_read_again($ascii_str, $file, $nudge);
>> }, "Ascii-range chars: $ascii_str";
>>
>> # write given string to file, then read the third character twice and
>> check
>> sub test_read_and_read_again($str, $file, $nudge = 0) {
>> spurt $file, $str;
>> my $fh = $file.IO.open;
>> $fh.readchars(2);  # skip a few
>> my $chr_1 =  $fh.readchars(1);
>> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always
>> 1 or 3
>> my $step_back = $width + $nudge;
>> $fh.seek: -$step_back, SeekFromCurrent;
>> my $chr_2 =  $fh.readchars(1);
>> is( $chr_1, $chr_2,
>> "read, seek back, and read again gets same char with nudge of
>> $nudge" );
>> }
>>
>


Re: readchars, seek back, and readchars again

2020-04-24 Thread Brad Gilbert
In UTF8 characters can be 1 to 4 bytes long.

UTF8 was designed so that 7-bit ASCII is a subset of it.

Any 8bit byte that has its most significant bit set cannot be ASCII.
So multi-byte codepoints have the most significant bit set for all of the
bytes.
The first byte can tell you the number of bytes that follow it.

That is how a singe codepoint is stored.

A character can be made of several codepoints.

"\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
"é"

So Rakudo has to read the next codepoint to make sure that it isn't a
combining codepoint.

It is probably faking up the reads to look right when reading ASCII, but
failing to do that for wider codepoints.

On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner  wrote:

> I thought that doing a readchars on a filehandle, seeking backwards
> the width of the char in bytes and then doing another read
> would always get the same character.  That works for ascii-range
> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
> characters (commonly 3-bytes in utf-8).
>
> The question then, is why do I need a $nudge of 3 for wide chars, but
> not ascii-range ones?
>
> use v6;
> use Test;
>
> my $tmpdir = IO::Spec::Unix.tmpdir;
> my $file = "$tmpdir/scratch_file.txt";
> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  #
> ሀⶀ䷼ꪪⲤⲎ
> my $ascii_str =   "ABCDEFGHI";
>
> subtest {
> my $nudge = 3;
> test_read_and_read_again($unichar_str, $file, $nudge);
> }, "Wide unicode chars: $unichar_str";
>
> subtest {
> my $nudge = 0;
> test_read_and_read_again($ascii_str, $file, $nudge);
> }, "Ascii-range chars: $ascii_str";
>
> # write given string to file, then read the third character twice and check
> sub test_read_and_read_again($str, $file, $nudge = 0) {
> spurt $file, $str;
> my $fh = $file.IO.open;
> $fh.readchars(2);  # skip a few
> my $chr_1 =  $fh.readchars(1);
> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always
> 1 or 3
> my $step_back = $width + $nudge;
> $fh.seek: -$step_back, SeekFromCurrent;
> my $chr_2 =  $fh.readchars(1);
> is( $chr_1, $chr_2,
> "read, seek back, and read again gets same char with nudge of
> $nudge" );
> }
>


readchars, seek back, and readchars again

2020-04-24 Thread Joseph Brenner
I thought that doing a readchars on a filehandle, seeking backwards
the width of the char in bytes and then doing another read
would always get the same character.  That works for ascii-range
characters (1-byte in utf-8 encoding) but not multi-byte "wide"
characters (commonly 3-bytes in utf-8).

The question then, is why do I need a $nudge of 3 for wide chars, but
not ascii-range ones?

use v6;
use Test;

my $tmpdir = IO::Spec::Unix.tmpdir;
my $file = "$tmpdir/scratch_file.txt";
my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  # ሀⶀ䷼ꪪⲤⲎ
my $ascii_str =   "ABCDEFGHI";

subtest {
my $nudge = 3;
test_read_and_read_again($unichar_str, $file, $nudge);
}, "Wide unicode chars: $unichar_str";

subtest {
my $nudge = 0;
test_read_and_read_again($ascii_str, $file, $nudge);
}, "Ascii-range chars: $ascii_str";

# write given string to file, then read the third character twice and check
sub test_read_and_read_again($str, $file, $nudge = 0) {
spurt $file, $str;
my $fh = $file.IO.open;
$fh.readchars(2);  # skip a few
my $chr_1 =  $fh.readchars(1);
my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always 1 or 3
my $step_back = $width + $nudge;
$fh.seek: -$step_back, SeekFromCurrent;
my $chr_2 =  $fh.readchars(1);
is( $chr_1, $chr_2,
"read, seek back, and read again gets same char with nudge of $nudge" );
}