Re: Using slurp to read in a utf16 file

2020-04-26 Thread Joseph Brenner
To expand on the point a bit, doing exactly the same spurt/slurp works
with "utf8", but doing it with "utf16" fails to read the text back in:

{
my $unichar_str =# ሀⶀ䷼ꪪⲤⲎ
   "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";

my $file = "/tmp/stuff_in_utf8.txt";
my $fh = $file.IO.open( :w, :enc("utf8") );
spurt $fh, $unichar_str;

my $contents = slurp( $file, :enc("utf8") );
my $huh = $contents.gist;
say "contents: $contents";
say "length: ", $contents.chars;
}

{
my $unichar_str =# ሀⶀ䷼ꪪⲤⲎ
   "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";

my $file = "/tmp/stuff_in_utf16.txt";
my $fh = $file.IO.open( :w, :enc("utf16") );
spurt $fh, $unichar_str;

my $contents = slurp( $file, :enc("utf16") );
my $huh = $contents.gist;
say "contents: $contents";#  contents:
say "length: ", $contents.chars;# 0
}


The output:
   contents: ሀⶀ䷼ꪪⲤⲎ
   length: 6
   contents:
   length: 0

The file definitely has something in it, though:

wc /tmp/stuff_in_utf16.txt
  0  1 14 /tmp/stuff_in_utf16.txt
cat /tmp/stuff_in_utf16.txt
 \377\376^@^R\200-\374M\252\252\244,\216,



On 4/26/20, Joseph Brenner  wrote:
> Looking at the documentation for slurp, it looks as though there's a
> convenient "enc" option you can use if you're not reading utf8 files.
> So I thought this would work:
>
>my $contents = slurp $file, enc => "utf16";
>
> It's not doing what I expected... Raku acts like there's nothing in
> $contents.
>
> Here's the test code I've been using:
>
> # ሀⶀ䷼ꪪⲤⲎ
> my $unichar_str =
>  "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";
>
> my $file = "/home/doom/tmp/stuff_in_utf16.txt";
> my $fh = $file.IO.open( :w, :enc("utf16") );
> spurt $fh, $unichar_str;
>
> # read entire file as utf16 Str
> my $contents = slurp $file, enc => "utf16";
> my $huh = $contents.gist;
> say "contents: $contents";  #  contents:
> say $contents.elems;# 1
>


Using slurp to read in a utf16 file

2020-04-26 Thread Joseph Brenner
Looking at the documentation for slurp, it looks as though there's a
convenient "enc" option you can use if you're not reading utf8 files.
So I thought this would work:

   my $contents = slurp $file, enc => "utf16";

It's not doing what I expected... Raku acts like there's nothing in $contents.

Here's the test code I've been using:

# ሀⶀ䷼ꪪⲤⲎ
my $unichar_str =
 "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";

my $file = "/home/doom/tmp/stuff_in_utf16.txt";
my $fh = $file.IO.open( :w, :enc("utf16") );
spurt $fh, $unichar_str;

# read entire file as utf16 Str
my $contents = slurp $file, enc => "utf16";
my $huh = $contents.gist;
say "contents: $contents";  #  contents:
say $contents.elems;# 1


Re: readchars, seek back, and readchars again

2020-04-26 Thread Joseph Brenner
I decided to open an issue for this one.  Even if there's no practical
fix for the behavior of readchars, I'd think this odd meaning of the
"current" point in the file would need to be better documented:

  https://github.com/rakudo/rakudo/issues/3646

I simplified the test I've been using:

use v6;
use Test;

my $tmpdir = IO::Spec::Unix.tmpdir;

# ሀⶀ䷼ꪪⲤⲎ
my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";
my $unichar_file = "$tmpdir/six_unicode_chars.txt";
spurt $unichar_file, $unichar_str;

my $ascii_str =   "ABCDEFGHI";
my $ascii_file = "$tmpdir/nine_ascii_chars.txt";
spurt $ascii_file, $ascii_str;


{
my $fh= $ascii_file.IO.open;
my $loc1  = $fh.tell;
my $char_count = 3;
my $str = readchars_no_advance($fh, $char_count);
my $loc2  = $fh.tell;
is( $loc1, $loc2,
"Testing that readchars file position works as expected for
ascii-range chars " );
}


{
my $fh= $unichar_file.IO.open;
my $loc1  = $fh.tell;
my $char_count = 3;
my $str = readchars_no_advance($fh, $char_count);
my $loc2  = $fh.tell;
is( $loc1, $loc2,
"Testing that readchars file position works as expected for
unichars beyond ascii-range" );
}

# After a readchar, this tries to return to the original position in the file
sub readchars_no_advance ($fh, $char_count) {
my $str   = $fh.readchars($char_count);
my $width = $str.encode('UTF-8').bytes;
$fh.seek: -$width, SeekFromCurrent;
return $str;
}




On 4/24/20, Brad Gilbert  wrote:
> In UTF8 characters can be 1 to 4 bytes long.
>
> UTF8 was designed so that 7-bit ASCII is a subset of it.
>
> Any 8bit byte that has its most significant bit set cannot be ASCII.
> So multi-byte codepoints have the most significant bit set for all of the
> bytes.
> The first byte can tell you the number of bytes that follow it.
>
> That is how a singe codepoint is stored.
>
> A character can be made of several codepoints.
>
> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
> "é"
>
> So Rakudo has to read the next codepoint to make sure that it isn't a
> combining codepoint.
>
> It is probably faking up the reads to look right when reading ASCII, but
> failing to do that for wider codepoints.
>
> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner  wrote:
>
>> I thought that doing a readchars on a filehandle, seeking backwards
>> the width of the char in bytes and then doing another read
>> would always get the same character.  That works for ascii-range
>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
>> characters (commonly 3-bytes in utf-8).
>>
>> The question then, is why do I need a $nudge of 3 for wide chars, but
>> not ascii-range ones?
>>
>> use v6;
>> use Test;
>>
>> my $tmpdir = IO::Spec::Unix.tmpdir;
>> my $file = "$tmpdir/scratch_file.txt";
>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  #
>> ሀⶀ䷼ꪪⲤⲎ
>> my $ascii_str =   "ABCDEFGHI";
>>
>> subtest {
>> my $nudge = 3;
>> test_read_and_read_again($unichar_str, $file, $nudge);
>> }, "Wide unicode chars: $unichar_str";
>>
>> subtest {
>> my $nudge = 0;
>> test_read_and_read_again($ascii_str, $file, $nudge);
>> }, "Ascii-range chars: $ascii_str";
>>
>> # write given string to file, then read the third character twice and
>> check
>> sub test_read_and_read_again($str, $file, $nudge = 0) {
>> spurt $file, $str;
>> my $fh = $file.IO.open;
>> $fh.readchars(2);  # skip a few
>> my $chr_1 =  $fh.readchars(1);
>> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always
>> 1 or 3
>> my $step_back = $width + $nudge;
>> $fh.seek: -$step_back, SeekFromCurrent;
>> my $chr_2 =  $fh.readchars(1);
>> is( $chr_1, $chr_2,
>> "read, seek back, and read again gets same char with nudge of
>> $nudge" );
>> }
>>
>