Re: The .bytes/.codepoints/.graphemes methods

2004-07-13 Thread David Green
In article [EMAIL PROTECTED],
 [EMAIL PROTECTED] (Larry Wall) wrote:

On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote:
:  :u0 # use bytes   (. is byte)
:  :u1 # level 1 support (. is codepoint)
:  :u2 # level 1 support (. is grapheme)
:  :u3 # level 1 support (. is language dependent)

These modifiers might get renamed to match whatever b/c/g/w convention
we come up with pragmas.  The levels aren't all that intuitive, though
there is a kind of progression of semantic complexity that would get
lost with ordinary names.

bytes
   codepts
  graphemes
 langdepends

That's a kind of progression.  And codepts seems a natural enough 
abbreviation, though I don't really know what to do with language_ 
dependent_thingummies.  Though with less typing, the initials b  c  g  l 
give the same progression.


 -David except for encodings where cb, of course Green


RE: The .bytes/.codepoints/.graphemes methods

2004-07-12 Thread Austin Hastings


 -Original Message-
 From: Jonadab the Unsightly One [mailto:[EMAIL PROTECTED]
 Austin Hastings [EMAIL PROTECTED] writes:

  I think this is something that we'll want as a mode, a la
  case-insensitivity. Think of it as mark insensitivity.

 Makes sense to me, but...

  Maybe it can just roll into :i?

 It will probably get used in _conjunction_ with
 case-insensitivity quite a lot, but I suspect people will want
to be able
 to use one without the other.

 Since mark-insensitivity is probably mostly a non-issue
 in the ASCII world, it would probably be a better candidate than
 average for being turned on using a unicode character, if we're
running
 low on letters for designating these rules.

How about :i ?

:) :) :)

=Austin



Re: The .bytes/.codepoints/.graphemes methods

2004-07-12 Thread Jonadab the Unsightly One
Luke Palmer [EMAIL PROTECTED] writes:

 Or, god forbid, a word?

 m:base/que mas/

 We're not mathematicians: we're allowed to use more than one letter
 in a row to designate something :-)

Well, if it were *me*, *I* would have voted for keeping the core
language 100% pure ASCII, untainted by rogue untypeable characters...
So naturally :base is fine by *me*...

-- 
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b-()}}
split//,[EMAIL PROTECTED]/ --;$\=$ ;- ();print$/



Re: The .bytes/.codepoints/.graphemes methods

2004-07-10 Thread Jonadab the Unsightly One
Austin Hastings [EMAIL PROTECTED] writes:

 I think this is something that we'll want as a mode, a la
 case-insensitivity. Think of it as mark insensitivity.

Makes sense to me, but...

 Maybe it can just roll into :i?

It will probably get used in _conjunction_ with case-insensitivity
quite a lot, but I suspect people will want to be able to use one
without the other.

Since mark-insensitivity is probably mostly a non-issue in the ASCII
world, it would probably be a better candidate than average for being
turned on using a unicode character, if we're running low on letters
for designating these rules.

-- 
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b-()}}
split//,[EMAIL PROTECTED]/ --;$\=$ ;- ();print$/



Re: The .bytes/.codepoints/.graphemes methods

2004-07-10 Thread Luke Palmer
Jonadab the Unsightly One writes:
 Austin Hastings [EMAIL PROTECTED] writes:
 
  I think this is something that we'll want as a mode, a la
  case-insensitivity. Think of it as mark insensitivity.
 
 Makes sense to me, but...
 
  Maybe it can just roll into :i?
 
 It will probably get used in _conjunction_ with case-insensitivity
 quite a lot, but I suspect people will want to be able to use one
 without the other.
 
 Since mark-insensitivity is probably mostly a non-issue in the ASCII
 world, it would probably be a better candidate than average for being
 turned on using a unicode character, if we're running low on letters
 for designating these rules.

Or, god forbid, a word?

m:base/que mas/

We're not mathematicians: we're allowed to use more than one letter in a
row to designate something :-)

Luke


Re: The .bytes/.codepoints/.graphemes methods

2004-07-08 Thread Austin Hastings
--- Larry Wall [EMAIL PROTECTED] wrote:
 On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote:
 
 : Or was that to imply that a literal a in the RE would be
 : interpretted as a grapheme a when :u2 is active?
 
 I don't know what you mean by grapheme a there.  If you mean, Does
 it match any grapheme that happens to be exactly U+0061?, then the
 answer is yes.  

In my original question, I meant to differentiate between 'grapheme'
and 'possible component of a multibyte expression'.

 If you mean Does it wildcard to any grapheme that uses
 U+0061 as the base character?, then the answer is probably no.  We
 have not yet come up with a syntax for that kind of wildcarding,
 other than dropping down to codepoints [:u1 a \pM+] or some such. 
 That may or may not be sufficient.  It'd be pretty easy to define a 
 like a assertion in any case.

I think this is something that we'll want as a mode, a la
case-insensitivity. Think of it as mark insensitivity.

I'm not sure if this should be language/locale dependent or not, but a
basic search feature for text is fre'd - fred. 

Maybe it can just roll into :i?

=Austin



Re: The .bytes/.codepoints/.graphemes methods

2004-07-07 Thread Larry Wall
On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote:
: On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote:
:  This has no direct bearing on p6l, since performance is a p6i issue.
:  But perhaps in the interests of performance as well as hackery we
:  should explicitly provide some sort of variant regex behavior:
:  
:  /a./ :bytes
:  /a./ :graphemes
:  
:  where the first would recognize 0x61 followed by any single byte, while
:  the second would recognize 'a' followed by any number of bytes
:  composing a single grapheme.
: 
: Isn't that what :u0, :u1, :u2, and :u3 are for?
: 
:   :u0 # use bytes   (. is byte)
:   :u1 # level 1 support (. is codepoint)
:   :u2 # level 1 support (. is grapheme)
:   :u3 # level 1 support (. is language dependent)

These modifiers might get renamed to match whatever b/c/g/w convention
we come up with pragmas.  The levels aren't all that intuitive, though
there is a kind of progression of semantic complexity that would get
lost with ordinary names.

: These modifiers say nothing about the state of the data, but in
: general internal Perl data will already be in Normalization Form
: C, so even under :u1, the precomposed characters will usually do
: the right thing.

These days it might be that most of the data we see will be maximally
decomposed rather than maximally composed.  But the jury is still out
on that.  And in any event, :u2 and :u3 should hide that distinction.

: Note that these modifiers are for overriding
: the default support level, which was probably set by pragma at
: the top of the file.

Another way of saying that is that these modifiers are, in fact,
lexically scoped pragmas with the *exact* same effect as the ordinary
Unicode level pragmas.  It's just that they're lexically scoped to
the rest of a rule or group rather than to the rest of a block.

: Or was that to imply that a literal a in the RE would be
: interpretted as a grapheme a when :u2 is active?

I don't know what you mean by grapheme a there.  If you mean, Does
it match any grapheme that happens to be exactly U+0061?, then the
answer is yes.  If you mean Does it wildcard to any grapheme that uses
U+0061 as the base character?, then the answer is probably no.  We
have not yet come up with a syntax for that kind of wildcarding, other
than dropping down to codepoints [:u1 a \pM+] or some such.  That may
or may not be sufficient.  It'd be pretty easy to define a like a
assertion in any case.

Larry


Re: The .bytes/.codepoints/.graphemes methods

2004-07-07 Thread Larry Wall
On Wed, Jul 07, 2004 at 08:09:51PM -0700, Larry Wall wrote:
: On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote:
: : On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote:
: :  This has no direct bearing on p6l, since performance is a p6i issue.
: :  But perhaps in the interests of performance as well as hackery we
: :  should explicitly provide some sort of variant regex behavior:
: :  
: :  /a./ :bytes
: :  /a./ :graphemes
: :  
: :  where the first would recognize 0x61 followed by any single byte, while
: :  the second would recognize 'a' followed by any number of bytes
: :  composing a single grapheme.
: : 
: : Isn't that what :u0, :u1, :u2, and :u3 are for?
: : 
: : :u0 # use bytes   (. is byte)
: : :u1 # level 1 support (. is codepoint)
: : :u2 # level 1 support (. is grapheme)
: : :u3 # level 1 support (. is language dependent)
: 
: These modifiers might get renamed to match whatever b/c/g/w convention
: we come up with pragmas.  The levels aren't all that intuitive, though
: there is a kind of progression of semantic complexity that would get
: lost with ordinary names.

On the flip side, a good reason to get rid of the numeric values is
that in all likelihood people will continually make the mistake of
thinking :u1 means one byte at a time and :u2 means two bytes at
a time.  And then they'll wonder why :u4 doesn't give them UTF-32...

Larry


Re: The .bytes/.codepoints/.graphemes methods

2004-07-03 Thread Brent 'Dax' Royal-Gordon
Aaron Sherman wrote:
On Tue, 2004-06-29 at 11:34, Austin Hastings wrote:
(2) Perl6 should equitably support all its target
locales; (3) we should set out to make sure the performance is damn
fast no matter what locale we're using.
Well, that's a nice theory, but you can prove that low-level encodings
(e.g. ASCII, EBCDIC) will be more efficient than high-level encodings
(e.g. UTF-8), so the only way to accomplish what you suggest in (2) is
to break (3) by slowing down the faster handling (not what you wanted,
I'm sure).
At the Parrot level, codepoint operations will generally be the most 
efficient, even on strings with exotic charsets.  Parrot uses an 
internal encoding that allows O(1) access to codepoints; essentially, it 
uses an array of 8-, 16-, or 32-bit integers, depending on the highest 
codepoint value.  This is the default even for character sets with shift 
characters, like Shift-JIS.

On strings where all codepoints have values under 256, bytewise and 
codepointwise lookup are equivalent; otherwise, though, bytewise lookup 
will actually be *slower* than codepointwise, as Parrot will maintain 
the illusion that each codepoint is stored in an integer that's the 
perfect size for it.

If you force Parrot to use the UTF-8 encoding internally then bytewise 
lookup becomes fastest, and codepointwise slows down a lot.  But you 
really shouldn't do that--UTF-8 is ill-suited for actually 
*manipulating* text, unlike the Parrot internal encodings.  (UTF-16 and 
UTF-32 will presumably be available too, although I've seen no specific 
mention of them.)

You can also force it to use a raw or bytes encoding, where bytes 
and codepoints are identical.  But you can't store Unicode characters in 
such a string and have them behave in a reasonable way.

(Note: this is all based on my own, possibly false, memory.)
--
Brent Dax Royal-Gordon [EMAIL PROTECTED]
Perl and Parrot hacker
Oceania has always been at war with Eastasia.


Re: The .bytes/.codepoints/.graphemes methods

2004-07-02 Thread Aaron Sherman
On Tue, 2004-06-29 at 11:34, Austin Hastings wrote:

 [...] when you switch to LC_ALL= pick your favorite
 language, you just get really slow performance: Apparently the 'C'
 locale is such a totally special case that the performance of LC_ALL=C
 is one or more orders of magnitude better than LC_ALL=en_US.UTF-8, even
 when the data is 7bit ascii.

Well, of course. I can't imagine a way in which this would not be true.

After all, in LC_ALL=C the number of characters in a string is equal
to the number of bytes in the string. In LC_ALL=en_US.UTF-8 the length
of a string is dependent on what exactly you mean by length, and a lot
of special cases arise. Special cases and context mean you have more
code to execute for the same logical task, which means you have more
processing to do.

Unicode support is expensive, even if you're just doing ASCII-as-UTF-8.
That doesn't mean it's a bad thing to do, it's just that it's expensive.

 I think that (1) this is unacceptable: the temptation to switch to the
 'C' locale has been too great, both at this site and on a lot of the RH
 support forums; 

And yet, in English-speaking countries (and Hawaiian and
Swahili-speaking countries for that matter) and in situations where the
fidelity of certain types of string data (such as names) is not
considered critical, this is a fine default. e.g. for general shell
work.

 (2) Perl6 should equitably support all its target
 locales; (3) we should set out to make sure the performance is damn
 fast no matter what locale we're using.

Well, that's a nice theory, but you can prove that low-level encodings
(e.g. ASCII, EBCDIC) will be more efficient than high-level encodings
(e.g. UTF-8), so the only way to accomplish what you suggest in (2) is
to break (3) by slowing down the faster handling (not what you wanted,
I'm sure).

Of course, you want to have as much performance out of string handling
as possible.

 This has no direct bearing on p6l, since performance is a p6i issue.
 But perhaps in the interests of performance as well as hackery we
 should explicitly provide some sort of variant regex behavior:
 
 /a./ :bytes
 /a./ :graphemes

As pointed out by others, this is already there, though I'm not sure
that it would be specified that way. More likely:

m :u0 /a./
[etc]

-- 
Aaron Sherman [EMAIL PROTECTED]
Senior Systems Engineer and Perl Toolsmith
http://www.ajs.com/~ajs/resume.html




Re: The .bytes/.codepoints/.graphemes methods

2004-07-01 Thread Matt Diephouse
Larry Wall wrote:
On Sat, Jun 26, 2004 at 12:27:38PM -0700, Brent 'Dax' Royal-Gordon wrote:
: Issues:
:   * Limits lvalue substr (doesn't allow it to be a different size)
: unless splice is used (or a substr method is also provided).
That all has to be looked at anyway.  What does 5 mean when you
pass it to substr, anyway?  (I've been trying to make it assume some
implicit unit based on the current lexical scope's Unicode level,
but issues remain.)  We have magical string positions that have
different numeric values depending on what units you view them as,
but at what point does a number like 5 get translated to such
a magical string position?
While we're on the topic of substr, allow me to beg. Please, can we 
replace substr with with array style operations like Ruby and Python? 
Please? Something like this would be nice:

 my $string = Hello, World!;
 say $string[0..4]; # prints Hello\n
 $string[7...] = Larry!;
 say $string; # prints Hello, Larry!\n
We already have our strings acting as objects, and we have [] as a 
postcircumfix operator, so it's something that someone could define 
easily. Of course, I have no idea how to reconcile this with all the 
talk of unicode other than to say that the easy stuff should be easy.

It just follows this would also be nice for arrays, to replace splice. 
For me, these two functions are the most bothersome part of Perl 5, and 
I would love to see them go.

matt


Re: The .bytes/.codepoints/.graphemes methods

2004-07-01 Thread Juerd
Matt Diephouse skribis 2004-06-30 20:51 (-0400):
  my $string = Hello, World!;
  say $string[0..4]; # prints Hello\n
  $string[7...] = Larry!;
  say $string; # prints Hello, Larry!\n

And that array is one of bytes? graphemes?

In general, I like the idea. In [EMAIL PROTECTED], almost
the same was suggested, but implemented differently: a string's .bytes
method in list context (but isn't it array context, technically?) would
dwym. As would the other parts-of-string methods.

Perhaps without method, the string in array/list context can default to
the default set by a lexical pragma. Which, I hope, has a default
itself. (I like default defaults...)


Juerd


Re: The .bytes/.codepoints/.graphemes methods

2004-07-01 Thread Matt Diephouse
Juerd wrote:
Matt Diephouse skribis 2004-06-30 20:51 (-0400):
my $string = Hello, World!;
say $string[0..4]; # prints Hello\n
$string[7...] = Larry!;
say $string; # prints Hello, Larry!\n

And that array is one of bytes? graphemes?
I'm not really up on my unicode, but I think .chars is what I have in 
mind. I want it to operate like a non-unicode string in Perl 5. Anything 
unicode can be more complex, as I think this will be the common case.

In general, I like the idea. In [EMAIL PROTECTED], almost
the same was suggested, but implemented differently: a string's .bytes
method in list context (but isn't it array context, technically?) would
dwym. As would the other parts-of-string methods.
Think of this as Huffmanized .chars then?
matt


Re: The .bytes/.codepoints/.graphemes methods

2004-07-01 Thread John Williams
On Thu, 1 Jul 2004, Juerd wrote:

 Matt Diephouse skribis 2004-06-30 20:51 (-0400):
   my $string = Hello, World!;
   say $string[0..4]; # prints Hello\n
   $string[7...] = Larry!;
   say $string; # prints Hello, Larry!\n

 And that array is one of bytes? graphemes?

 In general, I like the idea. In [EMAIL PROTECTED], almost
 the same was suggested, but implemented differently: a string's .bytes
 method in list context (but isn't it array context, technically?) would
 dwym. As would the other parts-of-string methods.

What if you could add the slice onto the method:

  my $string = Hello, World!;
  say $string.bytes[0..4]; # prints Hello\n
  $string.codepoints[7...] = Søren!;
  say $string; # prints Hello, Søren!\n

The string slicing operator would have to return an array of
bytes/codepoints/etc in list context and a substr in scalar context.

~ John Williams




Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Jonadab the Unsightly One
Dan Sugalski [EMAIL PROTECTED] writes:

 Hmm. Suppose that I have a system that is friendly to 80 byte
 records.  I want to output meaningful strings, so I want to
 partition a buffer into 80-ish byte substrings, but preserve any
 graphemes (i.e., store the data in a legible format).

 How would I do that?

 You don't. Or if you do, you do it with a lot of pain, sweat, and
 annoying hard work. 80 bytes gets you somewhere between three (And
 this may be a *high* estimate--there may be circumstances where 80
 bytes is insufficient for *one* grapheme) and 80 graphemes.

 This isn't something that can be made generically easy.

It's no worse than implementing word wrap.  Someone will of course
implement it as a generic routine, something along the lines of

my @line = breakunicodestringintobytebufferchunks(
   string = $string,
   chunksize = 80,
   keeptogether = 'graphemes',
   extremelongparts = 'split',
# 'split' will try to split it at a mostly-reasonable
#   place if possible, similar to word wrap that looks
#   for syllable boundaries.
# 'truncate' would do the same but drop the second part,
#   rather than putting it in the next line.
# 'skip' would drop the whole grapheme out.
# 'allow' would create a line longer (in bytes) than
#   the chunksize, which is what a lot of word wrap
#   algorithms do, but would not work if you really
#   have to fit in a fixed-byte-size buffer.  It would
#   of course put the thing on a line by itself though,
#   to minimize the overflow.
   );

There are reasons for doing this, e.g. if you've got Unicode text to
send via a network protocol with an octet-oriented RFC, or if you're
interacting with some legacy C code that has fixed-size buffers.
Someone will write the routine to do as well as can be expected, and
it'll be put on the CPAN, and people who need this sort of thing will
use it.

I don't think the language needs to be designed around it though.

-- 
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b-()}}
split//,[EMAIL PROTECTED]/ --;$\=$ ;- ();print$/



Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Jonadab the Unsightly One
Austin Hastings [EMAIL PROTECTED] writes:

 A couple of alternatives:

   substr.bytes($string, 2, 4) = $substitute;

Well, that's arguably better than bsubstr.

   substr($string.bytes, 2, 4) = $substitute;

I could live with that, although it doesn't allow mixing units.
(Someone will pop in here and say that's to be construed as a
feature.)

   # Make it a pragma
   use String(bytes); 
   substr($string, 2, 4) = substitute;

I think a pragma should set the default unit for the current lexical
scope, at least.  (The default, in the absense of the pragma, is an
open question; at worst the default could be to throw an exception if
units aren't specified; personally I think throwing exceptions willy
nilly is unPerlish.)

   # Make it a global mode
   set_string_mode(bytes);
   substr($string, 2, 4) = substitute;

I don't like this.  It's no more useful than the pragma but has bigger
caveats.

   # Make it an object mode
   $string.access_mode(bytes);
   substr($string, 2, 4) = $substitute;

Wouldn't this add extra operations all over the place?

 The word bytes is clearly much too long, though, much less
 graphemes or codepoints.  I thought about this:
 
 substr($string, 2b, 4b) = $substitute;

 Problems with:
  
   substr($string, 0b, 1b) = $substitute;

 Is that binary or bytes? Also:

I figured it would conflict with something.

   substr($string, $start b, $end b) = $substitute;

 Looks unintuitive.

*shrug*.  I chose it because I thought the other way around looked
unintuitive:
substr($string, b $start, b $end) = $substitute;

That looks like calling a function -- which *is* what's going on,
under the hood, but the other way around looks like tagging on units,
which seems more natural to me.

 With presumably g and c for graphemes and codepoints, but I rather
 suspect that might conflict with some other existing syntax (though I
 can't think of anything in particular).

 0c? 0x16c ?

Ick, yes, I missed that.  (I was thinking only of numbers specified in
decimal.)  I knew there'd be something.

 codes and graphs is better than codepoints and graphemes, at least.

 In certain (IMO large) sectors of the Perl community, string
 processing is just about all the work there is. I submit that there
 needs to be a way to drive the token length to 0: either a pragma,
 or a global mode, or a type definition.

A pragma should set the default, IMO.  I think what we're talking
about here is what the syntax would be for using a unit other than the
default, or for specifying the units if you haven't used the pragma to
set the default.

 You could coin the abbreviation ligs, for Language Independent
 Graphemes.  Then some ingenious rascal can create a pragma or
 whatever that allows $str.b, $str.c, $str.g, and $str.l for 
 fans of terseness.

 As opposed to 'ligs' meaning ligatures? Fraught with peril. :-)

I thought about that, but figured it wasn't a big deal; there are
*lots* of abbreviations with more than one possible interpretation,
and you just deal with having to know which one is meant.  However, it
was then pointed out that it would actually be ldgs, which IMO is
unpronounceable and ugly.  So something else is needed for those.

*shrug*.  Make up a word.  Call them woohickies for all I care and
abbreviate it woo or just w.

 I like graphemes for the default because I hate and fear
 graphemes. The whole *code thing just crawls right in my ear, so
 having the language transparently support it would be a win.

I can see the logic in that.  Personally I don't care what the default
is.  Almost none of my code will need to care one way or the other,
and that which does can use the pragma.

Have the implications of the bytes/codepoints/graphemes/woohickies
distinction for the regular expression engine been discussed already?

-- 
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b-()}}
split//,[EMAIL PROTECTED]/ --;$\=$ ;- ();print$/



Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Austin Hastings
--- Jonadab the Unsightly One [EMAIL PROTECTED] wrote:
 
 Have the implications of the bytes/codepoints/graphemes/woohickies
 distinction for the regular expression engine been discussed already?

Not enough.

One of my current clients just rolled on to redhat 9, and what a
steaming pile of digestive byproducts *that* turned out to be.
Apparently the default locale setting changed, so now LC_ALL= out of
the box.

One effect of this is irritating lack of proper behavior in the
utilities. But when you switch to LC_ALL= pick your favorite
language, you just get really slow performance: Apparently the 'C'
locale is such a totally special case that the performance of LC_ALL=C
is one or more orders of magnitude better than LC_ALL=en_US.UTF-8, even
when the data is 7bit ascii.

I think that (1) this is unacceptable: the temptation to switch to the
'C' locale has been too great, both at this site and on a lot of the RH
support forums; (2) Perl6 should equitably support all its target
locales; (3) we should set out to make sure the performance is damn
fast no matter what locale we're using.

This has no direct bearing on p6l, since performance is a p6i issue.
But perhaps in the interests of performance as well as hackery we
should explicitly provide some sort of variant regex behavior:

/a./ :bytes
/a./ :graphemes

where the first would recognize 0x61 followed by any single byte, while
the second would recognize 'a' followed by any number of bytes
composing a single grapheme.

(I'll claim that it's legitimate to want to search for, say, any MBCs
introduced via \x0F\x01, regardless of length. This is likely not
supported any other way.)

=Austin



Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Jonadab the Unsightly One
Juerd [EMAIL PROTECTED] writes:

 substr($string, 2 but graphemes, 4 but bytes);

 I think but even makes sense, if substr defaults to something.

That could be combined with a smart substr that only needs the units
once (err, only needs a position object for one of the args) and knows
how to conver the other number to the same units (err, same type of
position object):

substr($string, 2, 4 but bytes);

This would still allow for specifying units on both if you for some
reason wanted them different (which, as Dan S points out, sounds like
a bad idea, on the face of it).

:bytes is shorter than but bytes, though.

-- 
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b-()}}
split//,[EMAIL PROTECTED]/ --;$\=$ ;- ();print$/



Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Jonathan Scott Duff
On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote:
 This has no direct bearing on p6l, since performance is a p6i issue.
 But perhaps in the interests of performance as well as hackery we
 should explicitly provide some sort of variant regex behavior:
 
 /a./ :bytes
 /a./ :graphemes
 
 where the first would recognize 0x61 followed by any single byte, while
 the second would recognize 'a' followed by any number of bytes
 composing a single grapheme.

Isn't that what :u0, :u1, :u2, and :u3 are for?

:u0 # use bytes   (. is byte)
:u1 # level 1 support (. is codepoint)
:u2 # level 1 support (. is grapheme)
:u3 # level 1 support (. is language dependent)

These modifiers say nothing about the state of the data, but in
general internal Perl data will already be in Normalization Form
C, so even under :u1, the precomposed characters will usually do
the right thing. Note that these modifiers are for overriding
the default support level, which was probably set by pragma at
the top of the file.

Or was that to imply that a literal a in the RE would be
interpretted as a grapheme a when :u2 is active?

-Scott
-- 
Jonathan Scott Duff Division of Nearshore Research
[EMAIL PROTECTED]   Senior Systems Analyst II


Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Jonadab the Unsightly One
Larry Wall [EMAIL PROTECTED] writes:

 That all has to be looked at anyway.  What does 5 mean when you
 pass it to substr, anyway?  

I was just going to ask about substrings, and then didn't because I
figured that had been hashed out already and I'd missed it...

 (I've been trying to make it assume some implicit unit based on the
 current lexical scope's Unicode level, but issues remain.)  We have
 magical string positions that have different numeric values
 depending on what units you view them as, but at what point does a
 number like 5 get translated to such a magical string position?

It would be possible to have right-associative operators (that bind at
least more tightly than comma and possibly very tightly) and convert a
number to one of these objects, so that we can do stuff like this:

substr($string, 2 bytes, 4 bytes) = $substitute;

Then if you pass a plain number to substr it could either assume
something (possibly generating a warning) or spit an error, depending
on some feature of the current lexical scope.

The word bytes is clearly much too long, though, much less
graphemes or codepoints.  I thought about this:

substr($string, 2b, 4b) = $substitute;

With presumably g and c for graphemes and codepoints, but I rather
suspect that might conflict with some other existing syntax (though I
can't think of anything in particular).

And I can't think of another abbreviation that would be remotely
intuitive.

There's also the possibility of bsubstr and so on, but that leads us
down the path of C, having a hillion bajillion functions with names
like fgets, stoi, and fstrnclost.  Having sprintf is quite enough of
that, IMO.

 I dunno--it reads pretty well.  Maybe these'll be heavily enough
 used that we should Huffmanize them down a bit:

 $str.bytes
 $str.codes
 $str.graphs
 $str.letters

codes and graphs is better than codepoints and graphemes, at least.

 Though letters is a bit inadequate to describe language-dependent
 graphemes, since it also divides any non-letters...I suppose we
 could go with .characters if we don't mind forcing a heavily
 overloaded word in one particular direction, culturally speaking.
 Except, I'd kinda like to keep them starting with different letters.
 (And maybe .chars should be reserved to mean whatever the default
 unit is in the current lexical scope, as with substr() above.)
  
You could coin the abbreviation ligs, for Language Independent
Graphemes.  Then some ingenious rascal can create a pragma or whatever
that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness.

-- 
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b-()}}
split//,[EMAIL PROTECTED]/ --;$\=$ ;- ();print$/



Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Larry Wall
On Mon, Jun 28, 2004 at 11:26:32AM -0400, Jonadab the Unsightly One wrote:
: You could coin the abbreviation ligs, for Language Independent
: Graphemes.  Then some ingenious rascal can create a pragma or whatever
: that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness.

Except they'd have to be ldgs.  Graphemes are ligs in current parlance.

Larry


Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Dave Whipp
Jonadab The Unsightly One [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 It would be possible to have right-associative operators (that bind at
 least more tightly than comma and possibly very tightly) and convert a
 number to one of these objects, so that we can do stuff like this:

 substr($string, 2 bytes, 4 bytes) = $substitute;

I think that the common case will use the same units for both the index and
the length. So perhaps:

  substr($string, 2, 4 :bytes)

would be more appropriate. Also, by only requiring us to write the unit
once, the need for ultra-short abbreviations is reduced.


Dave.




Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Dan Sugalski
On Mon, 28 Jun 2004, Larry Wall wrote:

 On Mon, Jun 28, 2004 at 11:26:32AM -0400, Jonadab the Unsightly One wrote:
 : You could coin the abbreviation ligs, for Language Independent
 : Graphemes.  Then some ingenious rascal can create a pragma or whatever
 : that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness.

 Except they'd have to be ldgs.  Graphemes are ligs in current parlance.

And 'ligs' implies ligatures. And since that'd require font, style, and
possibly layout information, I think we'd rather not go there right now...

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk



Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Juerd
Dave Whipp skribis 2004-06-28  9:55 (-0700):
  substr($string, 2 bytes, 4 bytes) = $substitute;
 substr($string, 2, 4 :bytes)

substr($string, 2 but graphemes, 4 but bytes);

I think but even makes sense, if substr defaults to something.


Juerd


Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Dan Sugalski
On Mon, 28 Jun 2004, Juerd wrote:

 Dave Whipp skribis 2004-06-28  9:55 (-0700):
   substr($string, 2 bytes, 4 bytes) = $substitute;
  substr($string, 2, 4 :bytes)

 substr($string, 2 but graphemes, 4 but bytes);

 I think but even makes sense, if substr defaults to something.

I think mixing strings, bytes, graphemes, and code points together is a
phenomenally bad idea, likely to lead to many tears, much gnashing of
teeth, and quite a few rampages with sharp objects, not to mention a lot
of code guaranteed to fail at the edge cases.

If, as a programmer, you *really* want to run with scissors then convert
your string to a binary byte buffer and go from there. At least then when
you poke out an eye you won't be nearly so surprised.

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk



Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Austin Hastings
--- Dan Sugalski [EMAIL PROTECTED] wrote:
 On Mon, 28 Jun 2004, Juerd wrote:
 
  Dave Whipp skribis 2004-06-28  9:55 (-0700):
substr($string, 2 bytes, 4 bytes) = $substitute;
   substr($string, 2, 4 :bytes)
 
  substr($string, 2 but graphemes, 4 but bytes);
 
  I think but even makes sense, if substr defaults to something.
 
 I think mixing strings, bytes, graphemes, and code points together 
 is a phenomenally bad idea, likely to lead to many tears, much
 gnashing of teeth, and quite a few rampages with sharp objects,
 not to mention a lot of code guaranteed to fail at the edge cases.

Hmm. Suppose that I have a system that is friendly to 80 byte records.
I want to output meaningful strings, so I want to partition a buffer
into 80-ish byte substrings, but preserve any graphemes (i.e., store
the data in a legible format).

How would I do that?

The obvious answer is a gnarly little loop, but I think I'd like to
have perl do that for me. Can I say something like:

  while ($buffer)
  {
$output = substr($buffer, 0, 80 but bytes, units = graphemes);
$buffer = substr($buffer, 0, length $output :graphemes);

$cout  $output  nl; # :-)
  }

and get some dwimmery?

=Austin
 
 If, as a programmer, you *really* want to run with scissors then
 convert
 your string to a binary byte buffer and go from there. At least then
 when
 you poke out an eye you won't be nearly so surprised.
 
   Dan
 
 --it's like
 this---
 Dan Sugalski  even samurai
 [EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk
 
 



Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Dan Sugalski
On Mon, 28 Jun 2004, Austin Hastings wrote:

 --- Dan Sugalski [EMAIL PROTECTED] wrote:
  On Mon, 28 Jun 2004, Juerd wrote:
 
   Dave Whipp skribis 2004-06-28  9:55 (-0700):
 substr($string, 2 bytes, 4 bytes) = $substitute;
substr($string, 2, 4 :bytes)
  
   substr($string, 2 but graphemes, 4 but bytes);
  
   I think but even makes sense, if substr defaults to something.
 
  I think mixing strings, bytes, graphemes, and code points together
  is a phenomenally bad idea, likely to lead to many tears, much
  gnashing of teeth, and quite a few rampages with sharp objects,
  not to mention a lot of code guaranteed to fail at the edge cases.

 Hmm. Suppose that I have a system that is friendly to 80 byte records.
 I want to output meaningful strings, so I want to partition a buffer
 into 80-ish byte substrings, but preserve any graphemes (i.e., store
 the data in a legible format).

 How would I do that?

You don't. Or if you do, you do it with a lot of pain, sweat, and annoying
hard work. 80 bytes gets you somewhere between three (And this may be a
*high* estimate--there may be circumstances where 80 bytes is
insufficient for *one* grapheme) and 80 graphemes.

This isn't something that can be made generically easy.

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk



Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Austin Hastings
--- Jonadab the Unsightly One [EMAIL PROTECTED] wrote:
 Larry Wall [EMAIL PROTECTED] writes:
 
  (I've been trying to make it assume some implicit unit based on the
  current lexical scope's Unicode level, but issues remain.)  We have
  magical string positions that have different numeric values
  depending on what units you view them as, but at what point does a
  number like 5 get translated to such a magical string position?
 
 It would be possible to have right-associative operators (that bind
 at least more tightly than comma and possibly very tightly) and
 convert a number to one of these objects, so that we can do stuff 
 like this:
 
 substr($string, 2 bytes, 4 bytes) = $substitute;
 
 Then if you pass a plain number to substr it could either assume
 something (possibly generating a warning) or spit an error, depending
 on some feature of the current lexical scope.

A couple of alternatives:

  substr.bytes($string, 2, 4) = $substitute;

  substr($string.bytes, 2, 4) = $substitute;

  # Make it a pragma
  use String(bytes); 
  substr($string, 2, 4) = substitute;

  # Make it a global mode
  set_string_mode(bytes);
  substr($string, 2, 4) = substitute;

  # Make it an object mode
  $string.access_mode(bytes);
  substr($string, 2, 4) = $substitute;

 The word bytes is clearly much too long, though, much less
 graphemes or codepoints.  I thought about this:
 
 substr($string, 2b, 4b) = $substitute;

Problems with:
 
  substr($string, 0b, 1b) = $substitute;

Is that binary or bytes? Also:

  substr($string, $start b, $end b) = $substitute;

Looks unintuitive.

 With presumably g and c for graphemes and codepoints, but I rather
 suspect that might conflict with some other existing syntax (though I
 can't think of anything in particular).

0c? 0x16c ?

 And I can't think of another abbreviation that would be remotely
 intuitive.
 
 There's also the possibility of bsubstr and so on, but that leads us
 down the path of C, having a hillion bajillion functions with names
 like fgets, stoi, and fstrnclost.  Having sprintf is quite enough of
 that, IMO.
 
  I dunno--it reads pretty well.  Maybe these'll be heavily enough
  used that we should Huffmanize them down a bit:
 
  $str.bytes
  $str.codes
  $str.graphs
  $str.letters
 
 codes and graphs is better than codepoints and graphemes, at least.

In certain (IMO large) sectors of the Perl community, string processing
is just about all the work there is. I submit that there needs to be a
way to drive the token length to 0: either a pragma, or a global mode,
or a type definition.

 
  Though letters is a bit inadequate to describe language-dependent
  graphemes, since it also divides any non-letters...I suppose we
  could go with .characters if we don't mind forcing a heavily
  overloaded word in one particular direction, culturally speaking.
  Except, I'd kinda like to keep them starting with different
  letters.
  (And maybe .chars should be reserved to mean whatever the default
  unit is in the current lexical scope, as with substr() above.)
   
 You could coin the abbreviation ligs, for Language Independent
 Graphemes.  Then some ingenious rascal can create a pragma or
 whatever that allows $str.b, $str.c, $str.g, and $str.l for 
 fans of terseness.

As opposed to 'ligs' meaning ligatures? Fraught with peril. :-)

To me, the right thing to do is provide a 'default' way to work, and
allow for changing that default to some other way. The obvious defaults
are 'bytes', which gives C-like behavior (unpopular though that may
presently be) and imposes little or no conceptual strain but likewise
no enormous benefit, and 'graphemes'.

I like graphemes for the default because I hate and fear graphemes. The
whole *code thing just crawls right in my ear, so having the language
transparently support it would be a win. Having the language force me
to understand this stuff, if it cannot be transparently supported,
would also be a win, on a longer time scale.

=Austin