Re: "ICU - International Components for Unicode"

2020-09-27 Thread Samantha McVey
So MoarVM uses its own database of the UCD. One nice thing is this can 
probably be faster than calling to the ICU to look up information of each 
codepoint in a long string. Secondly it implements its own text data 
structures, so the nice features of the UCD to do that would be difficult to 
use.

In my opinion, it could make sense to use ICU for things like localized 
collation (sorting). It also could make sense to use ICU for unicode 
properties lookup for properties that don't have to do with grapheme 
segmentation or casing. This would be a lot of work but if something like this 
were implemented it would probably happen in the context of a larger 
rethinking of how we use unicode. Though everything is complicated by that we 
support lots of complicated regular expressions on different unicode 
properties. I guess first I'd start by benchmarking the speed of ICU and 
comparing to the current implementation.


Re: readchars, seek back, and readchars again

2020-04-28 Thread Samantha McVey
On zaterdag 25 april 2020 21:51:41 CEST Joseph Brenner wrote:
> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded 
strings:
> >  https://github.com/rakudo/rakudo/issues/3461
> >  
> >  I know it might be far-fetched, but what if your UTF-8 issue and
> 
> Yary's UTF-16 issue were related
> 
> Well, an issue with handling combining characters could easily effect
> both, nothing about it is specific to one encoding. Yary's issue
> doesn't have to do with reading from disk though, he's just looking at
> the raw bytes the encoding generates.
> 
> On 4/24/20, William Michels  wrote:
> > Hi Joe,
> > 
> > I was able to run the code you posted and reproduced the exact same
> > result (Rakudo version 2020.02.1..1 built on MoarVM version
> > 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a
> > bit
> > (e.g. UTF8-C8), but I didn't see any improvement.
> > 
> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded
> > strings:
> > 
> > https://github.com/rakudo/rakudo/issues/3461
> > 
> > I know it might be far-fetched, but what if your UTF-8 issue and
> > Yary's UTF-16 issue were related? It would be nice to kill two birds
> > with one stone.
> > 
> > Best Regards, Bill.
> > 
> > On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner  wrote:
> >> Another version of my test code, checking .tell throughout:
> >> 
> >> use v6;
> >> use Test;
> >> 
> >> my $tmpdir = IO::Spec::Unix.tmpdir;
> >> my $file = "$tmpdir/scratch_file.txt";
> >> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";  #
> >> ሀⶀ䷼ꪪⲤⲎ
> >> my $ascii_str =   "ABCDEFGHI";
> >> 
> >> test_read_and_read_again($unichar_str, $file, 3);
> >> test_read_and_read_again($ascii_str,   $file, 0);
> >> 
> >> # write given string to file, then read the third character twice and
> >> check
> >> sub test_read_and_read_again($str, $file, $nudge = 0) {
> >> 
> >> spurt $file, $str;
> >> my $fh = $file.IO.open;
> >> printf "%d: just opened\n", $fh.tell;
> >> $fh.readchars(2);  # skip a few
> >> printf "%d: after skipping 2\n", $fh.tell;
> >> my $chr_1 =  $fh.readchars(1);
> >> printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
> >> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always
> >> 
> >> 1 or 3
> >> 
> >> my $step_back = $width + $nudge;
> >> $fh.seek: -$step_back, SeekFromCurrent;
> >> printf "%d: after seeking back %d\n", $fh.tell, $step_back;
> >> my $chr_2 =  $fh.readchars(1);
> >> printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
> >> is( $chr_1, $chr_2,
> >> 
> >> "read, seek back, and read again gets same char with nudge of
> >> 
> >> $nudge" );
> >> }
> >> 
> >> 
> >> The output looks like so:
> >> 
> >> /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6
> >> 0: just opened
> >> 9: after skipping 2
> >> 12: after reading 3rd: ䷼
> >> 6: after seeking back 6
> >> 12: after re-reading 3rd: ䷼
> >> ok 1 - read, seek back, and read again gets same char with nudge of 3
> >> 0: just opened
> >> 2: after skipping 2
> >> 3: after reading 3rd: C
> >> 2: after seeking back 1
> >> 3: after re-reading 3rd: C
> >> ok 2 - read, seek back, and read again gets same char with nudge of 0
> >> 
> >> It's really hard to see what I should do if I really wanted to
> >> intermix readchars and seeks like this... I'd need to check the range
> >> of the codepoint to see how far I need to seek to get where I expect
> >> to be.
> >> 
> >> On 4/24/20, Joseph Brenner  wrote:
> >> > Thanks, yes I understand unicode and utf-8 reasonably well.
> >> > 
> >> >> So Rakudo has to read the next codepoint to make sure that it isn't a
> >> >> combining codepoint.
> >> >> 
> >> >> It is probably faking up the reads to look right when reading ASCII,
> >> >> but
> >> >> failing to do that for wider codepoints.
> >> > 
> >> > I think it'd be the other way around... the idea here would be it's
> >> > doing an extra readchar behind the scenes just in-case there's
> >> > combining chars involved-- so you're figuring there's some confusion
> >> > about the actual point in the file that's being read and the
> >> > abstraction that readchars is supplying?
> >> > 
> >> > On 4/24/20, Brad Gilbert  wrote:
> >> >> In UTF8 characters can be 1 to 4 bytes long.
> >> >> 
> >> >> UTF8 was designed so that 7-bit ASCII is a subset of it.
> >> >> 
> >> >> Any 8bit byte that has its most significant bit set cannot be ASCII.
> >> >> So multi-byte codepoints have the most significant bit set for all of
> >> >> the
> >> >> bytes.
> >> >> The first byte can tell you the number of bytes that follow it.
> >> >> 
> >> >> That is how a singe codepoint is stored.
> >> >> 
> >> >> A character can be made of several codepoints.
> >> >> 
> >> >> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
> >> >> "é"
> >> >> 
> >> >> So Rakudo has to read the next codepoint to make sure that it isn't a
> >> >> combining codepoint.
> >> >> 
> >> >> 

Re: readchars, seek back, and readchars again

2020-04-28 Thread Samantha McVey
On maandag 27 april 2020 09:49:20 CEST Joseph Brenner wrote:
> After you do a .readchars, what point in the file would you expect to
> be "current"?  I would expect it would be the point right after the
> last char read.  Instead that's true if you're reading ascii
> characters but not unicode characters up above the ascii range, in
> which case the "current" point is larger than that (in the cases I've
> looked at, larger by 3 bytes).
> 
> If you try to intermix readchars with calls to .seek using the
> "SeekFromCurrent" feature, it can be tricky to predict where you're
> going to end up, because the point you're starting at depends on what
> kind text you've been reading, not just the number of bytes you've
> read.
> 
> Is that making any sense?  I posted a later code example that might
> show the problem more clearly...
> 
> On 4/26/20, Samantha McVey  wrote:
> > On zaterdag 25 april 2020 21:51:41 CEST Joseph Brenner wrote:
> >> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded
> > 
> > strings:
> >> >  https://github.com/rakudo/rakudo/issues/3461
> >> >  
> >> >  I know it might be far-fetched, but what if your UTF-8 issue and
> >> 
> >> Yary's UTF-16 issue were related
> >> 
> >> Well, an issue with handling combining characters could easily effect
> >> both, nothing about it is specific to one encoding. Yary's issue
> >> doesn't have to do with reading from disk though, he's just looking at
> >> the raw bytes the encoding generates.
> >> 
> >> On 4/24/20, William Michels  wrote:
> >> > Hi Joe,
> >> > 
> >> > I was able to run the code you posted and reproduced the exact same
> >> > result (Rakudo version 2020.02.1..1 built on MoarVM version
> >> > 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a
> >> > bit
> >> > (e.g. UTF8-C8), but I didn't see any improvement.
> >> > 
> >> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded
> >> > strings:
> >> > 
> >> > https://github.com/rakudo/rakudo/issues/3461
> >> > 
> >> > I know it might be far-fetched, but what if your UTF-8 issue and
> >> > Yary's UTF-16 issue were related? It would be nice to kill two birds
> >> > with one stone.
> >> > 
> >> > Best Regards, Bill.
> >> > 
> >> > On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner 
> >> > 
> >> > wrote:
> >> >> Another version of my test code, checking .tell throughout:
> >> >> 
> >> >> use v6;
> >> >> use Test;
> >> >> 
> >> >> my $tmpdir = IO::Spec::Unix.tmpdir;
> >> >> my $file = "$tmpdir/scratch_file.txt";
> >> >> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]";
> >> >> #
> >> >> ሀⶀ䷼ꪪⲤⲎ
> >> >> my $ascii_str =   "ABCDEFGHI";
> >> >> 
> >> >> test_read_and_read_again($unichar_str, $file, 3);
> >> >> test_read_and_read_again($ascii_str,   $file, 0);
> >> >> 
> >> >> # write given string to file, then read the third character twice and
> >> >> check
> >> >> sub test_read_and_read_again($str, $file, $nudge = 0) {
> >> >> 
> >> >> spurt $file, $str;
> >> >> my $fh = $file.IO.open;
> >> >> printf "%d: just opened\n", $fh.tell;
> >> >> $fh.readchars(2);  # skip a few
> >> >> printf "%d: after skipping 2\n", $fh.tell;
> >> >> my $chr_1 =  $fh.readchars(1);
> >> >> printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
> >> >> my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes,
> >> >> 
> >> >> always
> >> >> 
> >> >> 1 or 3
> >> >> 
> >> >> my $step_back = $width + $nudge;
> >> >> $fh.seek: -$step_back, SeekFromCurrent;
> >> >> printf "%d: after seeking back %d\n", $fh.tell, $step_back;
> >> >> my $chr_2 =  $fh.readchars(1);
> >> >> printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
> >> >> is( $chr_1, $chr_2,
> >> >> 
> >> >> "read, seek bac