Re: The .bytes/.codepoints/.graphemes methods

2004-07-13 Thread David Green
In article [EMAIL PROTECTED], [EMAIL PROTECTED] (Larry Wall) wrote: On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote: : :u0 # use bytes (. is byte) : :u1 # level 1 support (. is codepoint) : :u2 # level 1 support (. is

RE: The .bytes/.codepoints/.graphemes methods

2004-07-12 Thread Austin Hastings
-Original Message- From: Jonadab the Unsightly One [mailto:[EMAIL PROTECTED] Austin Hastings [EMAIL PROTECTED] writes: I think this is something that we'll want as a mode, a la case-insensitivity. Think of it as mark insensitivity. Makes sense to me, but... Maybe it can just

Re: The .bytes/.codepoints/.graphemes methods

2004-07-12 Thread Jonadab the Unsightly One
Luke Palmer [EMAIL PROTECTED] writes: Or, god forbid, a word? m:base/que mas/ We're not mathematicians: we're allowed to use more than one letter in a row to designate something :-) Well, if it were *me*, *I* would have voted for keeping the core language 100% pure ASCII, untainted by

Re: The .bytes/.codepoints/.graphemes methods

2004-07-10 Thread Jonadab the Unsightly One
Austin Hastings [EMAIL PROTECTED] writes: I think this is something that we'll want as a mode, a la case-insensitivity. Think of it as mark insensitivity. Makes sense to me, but... Maybe it can just roll into :i? It will probably get used in _conjunction_ with case-insensitivity quite a

Re: The .bytes/.codepoints/.graphemes methods

2004-07-10 Thread Luke Palmer
Jonadab the Unsightly One writes: Austin Hastings [EMAIL PROTECTED] writes: I think this is something that we'll want as a mode, a la case-insensitivity. Think of it as mark insensitivity. Makes sense to me, but... Maybe it can just roll into :i? It will probably get used in

Re: The .bytes/.codepoints/.graphemes methods

2004-07-08 Thread Austin Hastings
--- Larry Wall [EMAIL PROTECTED] wrote: On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote: : Or was that to imply that a literal a in the RE would be : interpretted as a grapheme a when :u2 is active? I don't know what you mean by grapheme a there. If you mean, Does it

Re: The .bytes/.codepoints/.graphemes methods

2004-07-07 Thread Larry Wall
On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote: : On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote: : This has no direct bearing on p6l, since performance is a p6i issue. : But perhaps in the interests of performance as well as hackery we : should explicitly

Re: The .bytes/.codepoints/.graphemes methods

2004-07-07 Thread Larry Wall
On Wed, Jul 07, 2004 at 08:09:51PM -0700, Larry Wall wrote: : On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote: : : On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote: : : This has no direct bearing on p6l, since performance is a p6i issue. : : But perhaps in the

Re: The .bytes/.codepoints/.graphemes methods

2004-07-03 Thread Brent 'Dax' Royal-Gordon
Aaron Sherman wrote: On Tue, 2004-06-29 at 11:34, Austin Hastings wrote: (2) Perl6 should equitably support all its target locales; (3) we should set out to make sure the performance is damn fast no matter what locale we're using. Well, that's a nice theory, but you can prove that low-level

Re: The .bytes/.codepoints/.graphemes methods

2004-07-02 Thread Aaron Sherman
On Tue, 2004-06-29 at 11:34, Austin Hastings wrote: [...] when you switch to LC_ALL= pick your favorite language, you just get really slow performance: Apparently the 'C' locale is such a totally special case that the performance of LC_ALL=C is one or more orders of magnitude better than

Re: The .bytes/.codepoints/.graphemes methods

2004-07-01 Thread Matt Diephouse
Larry Wall wrote: On Sat, Jun 26, 2004 at 12:27:38PM -0700, Brent 'Dax' Royal-Gordon wrote: : Issues: : * Limits lvalue substr (doesn't allow it to be a different size) : unless splice is used (or a substr method is also provided). That all has to be looked at anyway. What does 5 mean when

Re: The .bytes/.codepoints/.graphemes methods

2004-07-01 Thread Juerd
Matt Diephouse skribis 2004-06-30 20:51 (-0400): my $string = Hello, World!; say $string[0..4]; # prints Hello\n $string[7...] = Larry!; say $string; # prints Hello, Larry!\n And that array is one of bytes? graphemes? In general, I like the idea. In [EMAIL PROTECTED], almost the same was

Re: The .bytes/.codepoints/.graphemes methods

2004-07-01 Thread Matt Diephouse
Juerd wrote: Matt Diephouse skribis 2004-06-30 20:51 (-0400): my $string = Hello, World!; say $string[0..4]; # prints Hello\n $string[7...] = Larry!; say $string; # prints Hello, Larry!\n And that array is one of bytes? graphemes? I'm not really up on my unicode, but I think .chars is what I have

Re: The .bytes/.codepoints/.graphemes methods

2004-07-01 Thread John Williams
On Thu, 1 Jul 2004, Juerd wrote: Matt Diephouse skribis 2004-06-30 20:51 (-0400): my $string = Hello, World!; say $string[0..4]; # prints Hello\n $string[7...] = Larry!; say $string; # prints Hello, Larry!\n And that array is one of bytes? graphemes? In general, I like the idea.

Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Jonadab the Unsightly One
Dan Sugalski [EMAIL PROTECTED] writes: Hmm. Suppose that I have a system that is friendly to 80 byte records. I want to output meaningful strings, so I want to partition a buffer into 80-ish byte substrings, but preserve any graphemes (i.e., store the data in a legible format). How would I

Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Jonadab the Unsightly One
Austin Hastings [EMAIL PROTECTED] writes: A couple of alternatives: substr.bytes($string, 2, 4) = $substitute; Well, that's arguably better than bsubstr. substr($string.bytes, 2, 4) = $substitute; I could live with that, although it doesn't allow mixing units. (Someone will pop in here

Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Austin Hastings
--- Jonadab the Unsightly One [EMAIL PROTECTED] wrote: Have the implications of the bytes/codepoints/graphemes/woohickies distinction for the regular expression engine been discussed already? Not enough. One of my current clients just rolled on to redhat 9, and what a steaming pile of

Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Jonadab the Unsightly One
Juerd [EMAIL PROTECTED] writes: substr($string, 2 but graphemes, 4 but bytes); I think but even makes sense, if substr defaults to something. That could be combined with a smart substr that only needs the units once (err, only needs a position object for one of the args) and knows how to

Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Jonathan Scott Duff
On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote: This has no direct bearing on p6l, since performance is a p6i issue. But perhaps in the interests of performance as well as hackery we should explicitly provide some sort of variant regex behavior: /a./ :bytes /a./

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Jonadab the Unsightly One
Larry Wall [EMAIL PROTECTED] writes: That all has to be looked at anyway. What does 5 mean when you pass it to substr, anyway? I was just going to ask about substrings, and then didn't because I figured that had been hashed out already and I'd missed it... (I've been trying to make it

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Larry Wall
On Mon, Jun 28, 2004 at 11:26:32AM -0400, Jonadab the Unsightly One wrote: : You could coin the abbreviation ligs, for Language Independent : Graphemes. Then some ingenious rascal can create a pragma or whatever : that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness. Except

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Dave Whipp
Jonadab The Unsightly One [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] It would be possible to have right-associative operators (that bind at least more tightly than comma and possibly very tightly) and convert a number to one of these objects, so that we can do stuff like this:

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Dan Sugalski
On Mon, 28 Jun 2004, Larry Wall wrote: On Mon, Jun 28, 2004 at 11:26:32AM -0400, Jonadab the Unsightly One wrote: : You could coin the abbreviation ligs, for Language Independent : Graphemes. Then some ingenious rascal can create a pragma or whatever : that allows $str.b, $str.c, $str.g, and

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Juerd
Dave Whipp skribis 2004-06-28 9:55 (-0700): substr($string, 2 bytes, 4 bytes) = $substitute; substr($string, 2, 4 :bytes) substr($string, 2 but graphemes, 4 but bytes); I think but even makes sense, if substr defaults to something. Juerd

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Dan Sugalski
On Mon, 28 Jun 2004, Juerd wrote: Dave Whipp skribis 2004-06-28 9:55 (-0700): substr($string, 2 bytes, 4 bytes) = $substitute; substr($string, 2, 4 :bytes) substr($string, 2 but graphemes, 4 but bytes); I think but even makes sense, if substr defaults to something. I think mixing

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Austin Hastings
--- Dan Sugalski [EMAIL PROTECTED] wrote: On Mon, 28 Jun 2004, Juerd wrote: Dave Whipp skribis 2004-06-28 9:55 (-0700): substr($string, 2 bytes, 4 bytes) = $substitute; substr($string, 2, 4 :bytes) substr($string, 2 but graphemes, 4 but bytes); I think but even makes sense, if

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Dan Sugalski
On Mon, 28 Jun 2004, Austin Hastings wrote: --- Dan Sugalski [EMAIL PROTECTED] wrote: On Mon, 28 Jun 2004, Juerd wrote: Dave Whipp skribis 2004-06-28 9:55 (-0700): substr($string, 2 bytes, 4 bytes) = $substitute; substr($string, 2, 4 :bytes) substr($string, 2 but

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Austin Hastings
--- Jonadab the Unsightly One [EMAIL PROTECTED] wrote: Larry Wall [EMAIL PROTECTED] writes: (I've been trying to make it assume some implicit unit based on the current lexical scope's Unicode level, but issues remain.) We have magical string positions that have different numeric values