Author: larry
Date: Sat Apr 22 11:24:56 2006
New Revision: 8910

Modified:
   doc/trunk/design/syn/S02.pod
   doc/trunk/design/syn/S03.pod
   doc/trunk/design/syn/S05.pod

Log:
Fixes from Daniel and Markus.
Various clarifications on string positions vs Str and Buf types
Broke down and added the <at($pos)> assertion.


Modified: doc/trunk/design/syn/S02.pod
==============================================================================
--- doc/trunk/design/syn/S02.pod        (original)
+++ doc/trunk/design/syn/S02.pod        Sat Apr 22 11:24:56 2006
@@ -12,9 +12,9 @@
 
   Maintainer: Larry Wall <[EMAIL PROTECTED]>
   Date: 10 Aug 2004
-  Last Modified: 21 Apr 2006
+  Last Modified: 22 Apr 2006
   Number: 2
-  Version: 28
+  Version: 29
 
 This document summarizes Apocalypse 2, which covers small-scale
 lexical items and typological issues.  (These Synopses also contain
@@ -393,8 +393,35 @@
 
 =item *
 
-A C<Str> is a Unicode string object.  (There is no corresponding
-native C<str> type.)  A C<Buf> is a stringish view of an array of
+A C<Str> is a Unicode string object.  There is no corresponding native
+C<str> type.  However, since a C<Str> object may fill multiple roles,
+we say that a C<Str> keeps track of its minimum and maximum Unicode
+abstraction levels, and plays along nicely with the current lexical
+scope's idea of the ideal character, whether that is bytes, codepoints,
+graphemes, or characters in some language.  For all builtin operations,
+all C<Str> positions are reported as position objects, not integers.
+These C<StrPos> objects point into a particular string at a particular
+location independent of abstraction level.  The subtraction of two
+C<StrPos> objects gives a C<StrLen> object, which is still not an
+integer, because the string between two positions also has multiple
+integer interpretations depending on the units.  A given C<StrLen>
+may know that it represents 18 bytes, 7 codepoints, and 3 graphemes,
+but it knows this lazily because it actually just hangs onto the two
+C<StrPos> objects.  (It's much like a C<Range> object in that respect.)
+
+If you use integers as arguments where position objects are expected,
+it will be assumed that you mean the units of the current lexically
+scoped Unicode abstraction level.  (Which defaults to graphemes.)
+Otherwise you'll need to coerce to the proper units:
+
+    substr($string, 42.as(Bytes), 1.as(ArabicChars))
+
+Of course, such a dimensional number will fail if used on a string
+that doesn't provide the appropriate abstraction level.
+
+=item *
+
+A C<Buf> is a stringish view of an array of
 integers, and has no Unicode or character properties without explicit
 conversion to some kind of C<Str>.  (A C<buf> is the native counterpart.)
 Typically it's an array of bytes serving as a buffer.  Bitwise
@@ -407,6 +434,17 @@
 appropriate C<Buf> interface), but when used to create a buffer C<Buf>
 defaults to C<buf8>.
 
+Unlike C<Str> types, C<Buf> types prefer to deal with integer string
+positions, and map these directly to the underlying compact array
+as indices.  That is, these are not necessarily byte positions--an
+integer position just counts over the number of underlying positions,
+where one position means one cell of the underlying integer type.
+Builtin string operations on C<Buf> types return integers and expect
+integers when dealing with positions.  As a limiting case, C<buf8> is
+just an old-school byte string, and the positions are byte positions.
+Note, though, that if you remap a section of C<buf32> memory to be
+C<buf8>, you'll have to multiply all your positions by 4.
+
 =back
 
 =head1 Names and Variables

Modified: doc/trunk/design/syn/S03.pod
==============================================================================
--- doc/trunk/design/syn/S03.pod        (original)
+++ doc/trunk/design/syn/S03.pod        Sat Apr 22 11:24:56 2006
@@ -328,7 +328,7 @@
 for operators like C<< < >> that don't return the same type as they
 take, so these kinds of operators overload the single-argument case
 to return something more meaningful.  All the comparison operators
-return a boolean for either 1 or 0 arguments.  Negated operators,
+return a boolean for either 1 or 0 arguments.  Negated operators
 return C<Bool::False>, and all the rest return C<Bool::True>.
 
 This metaoperator can also be used on the semicolon second-dimension
@@ -496,6 +496,7 @@
     my $foo            # ordinary lexically scoped variable
     our $foo           # lexically scoped alias to package variable
     has $foo           # object attribute
+    env $foo           # environmental lexical
     state $foo         # persistent lexical (cloned with closures)
     constant $foo      # lexically scoped compile-time constant
 
@@ -538,7 +539,7 @@
 
 Note that C<temp> and C<let> are I<not> variable declarators, because
 their effects only take place at runtime.  Therefore, they take an ordinary
-lvalue object as their arguments.  See S04 for more details.
+lvalue object as their argument.  See S04 for more details.
 
 There are a number of other declarators that are not variable
 declarators.  These include both type declarators:

Modified: doc/trunk/design/syn/S05.pod
==============================================================================
--- doc/trunk/design/syn/S05.pod        (original)
+++ doc/trunk/design/syn/S05.pod        Sat Apr 22 11:24:56 2006
@@ -14,9 +14,9 @@
    Maintainer: Patrick Michaud <[EMAIL PROTECTED]> and
                Larry Wall <[EMAIL PROTECTED]>
    Date: 24 Jun 2002
-   Last Modified: 21 Apr 2006
+   Last Modified: 22 Apr 2006
    Number: 5
-   Version: 20
+   Version: 21
 
 This document summarizes Apocalypse 5, which is about the new regex
 syntax.  We now try to call them I<regex> because they haven't been
@@ -550,7 +550,9 @@
 
 =item *
 
-If the value is a number, the key is rematched ignoring any keys
+If the value is a number, this entry represents a "false match".
+The match position is set back to before the current false match, and the
+key is rematched using the same hash, but this time ignoring any keys
 longer than the number.  (This is measured in the default Unicode
 level in effect where the hash was declared, usually graphemes. If
 the current Unicode level is lower, the results are as if the string
@@ -651,16 +653,18 @@
 
 =item *
 
-A leading C<%> matches like a bare hash except that each value is
+A leading C<%> matches like a bare hash except that a string value is
 always treated as a subrule, even if it is a string that must be compiled
-to a regex at match time.
+to a regex at match time.  (Numeric values may still indicate "false match".
+and a closure may do whatever it likes.)
 
 By default C<< <%foo> >> is captured into C<< $<foo> >>, but you can
 use the C<< <?%foo> >> form to suppress capture, and you can always say
 C<< $<%foo> := <%foo> >> if you prefer to include the sigil in the key.
 
-With both bare hash and hash in angles, the key is always skipped
-over before calling any subrule in the value.  That subrule may, however,
+With both bare hash and hash in angles, the key is counted as "matched"
+immediately; that is, the current match position is set to C<after> the key
+token before calling any subrule in the value.  That subrule may, however,
 magically access the key anyway as if the subrule had started before the
 key and matched with C<< <KEY> >> assertion.  That is, C<< $<KEY> >>
 will contain the keyword or token that this subrule was looked up under,
@@ -857,10 +861,29 @@
 
 =item *
 
-The C<\G> sequence is gone.  Use C<:p> instead.  (Note, however, that
-it makes no sense to use C<:p> within a pattern, since every internal
-pattern is implicitly anchored to the current position.  You'll have
-to explicitly compare C<< <( .pos == $oldpos )> >> in that case.)
+The C<\G> sequence is gone.  Use C<:p> instead.  (Note, however,
+that it makes no sense to use C<:p> within a pattern, since every
+internal pattern is implicitly anchored to the current position.)
+To anchor to a particular position in the general case you can use
+the C<< <at($pos)> >> assertion to say that the current position
+is the same as the position object you supply.  Please remember
+that in Perl 6 string positions are generally I<not> integers, but
+objects that point to a particular place in the string regardless
+of whether you count by bytes or codepoints or graphemes.  If used
+with an integer, the C<at> assertion will assume you mean the current
+lexically scoped Unicode level, on the assumption that this integer was
+somehow generated in this same lexical scope.  If this is outside the
+current string's allowed abstraction levels, an exception is thrown.
+See S02 for more discussion of string positions.
+
+C<Buf> types are based on fixed-width cells and can therefore
+handle integer positions just fine, and treat them as array indices.
+In particular, C<buf8> AKA C<buf> is just an old-school byte string.
+However, as with matching on any C<Array> type, if you do matching on
+a C<buf32> and end up in the middle of a 32-bit cell, you'll still
+get an opaque C<StrPos> that remembers both the element offset and
+the position within that element.  If you force it to give you a
+numeric offset, I wouldn't blame it for giving you a fractional value.
 
 =item *
 

Reply via email to