[svn:perl6-synopsis] r14484 - doc/trunk/design/syn

2008-01-10 Thread larry
Author: larry
Date: Thu Jan 10 11:51:20 2008
New Revision: 14484

Modified:
   doc/trunk/design/syn/S05.pod

Log:
Added :ii and :bb substitution modifiers as suggested by ruoso++


Modified: doc/trunk/design/syn/S05.pod
==
--- doc/trunk/design/syn/S05.pod(original)
+++ doc/trunk/design/syn/S05.podThu Jan 10 11:51:20 2008
@@ -14,9 +14,9 @@
Maintainer: Patrick Michaud [EMAIL PROTECTED] and
Larry Wall [EMAIL PROTECTED]
Date: 24 Jun 2002
-   Last Modified: 28 Dec 2007
+   Last Modified: 10 Jan 2008
Number: 5
-   Version: 69
+   Version: 70
 
 This document summarizes Apocalypse 5, which is about the new regex
 syntax.  We now try to call them Iregex rather than regular
@@ -190,6 +190,12 @@
 ignored in its lexical scope, but not in its dynamic scope.  That is,
 subrules always use their own case settings.
 
+The C:ii variant may be used on a substitution to change the
+substituted string to the same case pattern as the matched string.
+Case info is carried across on a character by character basis.  If
+the right string is longer than the left one, the case of the final
+character is replicated.
+
 =item *
 
 The C:b (or C:basechar) modifier scopes exactly like C:ignorecase
@@ -202,6 +208,12 @@
 includes all ignored characters, including any that follow the final
 base character.
 
+The C:bb variant may be used on a substitution to change the
+substituted string to the same accent pattern as the matched string.
+Accent info is carried across on a character by character basis.  If
+the right string is longer than the left one, the remaining characters
+are substituted without any modification.
+
 =item *
 
 The C:c (or C:continue) modifier causes the pattern to continue


[svn:perl6-synopsis] r14485 - doc/trunk/design/syn

2008-01-10 Thread larry
Author: larry
Date: Thu Jan 10 12:28:57 2008
New Revision: 14485

Modified:
   doc/trunk/design/syn/S05.pod

Log:
Some clarifications suggested by moritz++


Modified: doc/trunk/design/syn/S05.pod
==
--- doc/trunk/design/syn/S05.pod(original)
+++ doc/trunk/design/syn/S05.podThu Jan 10 12:28:57 2008
@@ -192,9 +192,15 @@
 
 The C:ii variant may be used on a substitution to change the
 substituted string to the same case pattern as the matched string.
-Case info is carried across on a character by character basis.  If
-the right string is longer than the left one, the case of the final
-character is replicated.
+Case info is carried across on a character by character basis.  If the
+right string is longer than the left one, the case of the final
+character is replicated.  Titlecase is carried across if possible
+regardless of whether the resulting letter is at the beginning of
+a word or not; if there is no titlecase character available, the
+corresponding uppercase character is used.  (This policy can be
+modified within a lexical scope by a language-dependent Unicode
+declaration to substitute titlecase according to the orthographic
+rules of the specified language.)
 
 =item *
 
@@ -212,7 +218,8 @@
 substituted string to the same accent pattern as the matched string.
 Accent info is carried across on a character by character basis.  If
 the right string is longer than the left one, the remaining characters
-are substituted without any modification.
+are substituted without any modification.  (Note that NFD/NFC distinctions
+are usually immaterial, since Perl encapsulates that in grapheme mode.)
 
 =item *
 


[svn:perl6-synopsis] r14487 - doc/trunk/design/syn

2008-01-10 Thread larry
Author: larry
Date: Thu Jan 10 13:16:27 2008
New Revision: 14487

Modified:
   doc/trunk/design/syn/S02.pod

Log:
typo


Modified: doc/trunk/design/syn/S02.pod
==
--- doc/trunk/design/syn/S02.pod(original)
+++ doc/trunk/design/syn/S02.podThu Jan 10 13:16:27 2008
@@ -720,7 +720,7 @@
 id should be assigned that uniquely identifies the grapheme.
 If such identifiers are assigned consistently thoughout the process,
 comparison of two graphemes is no more difficult than the comparison
-of two integers, and comparison of base characters no more different
+of two integers, and comparison of base characters no more difficult
 than a direct lookup into the id-to-NFD table.
 
 Obviously, any temporary grapheme ids must be translated back to


[svn:perl6-synopsis] r14486 - doc/trunk/design/syn

2008-01-10 Thread larry
Author: larry
Date: Thu Jan 10 13:05:42 2008
New Revision: 14486

Modified:
   doc/trunk/design/syn/S02.pod

Log:
Added some random thoughts about performance implications of grapheme view


Modified: doc/trunk/design/syn/S02.pod
==
--- doc/trunk/design/syn/S02.pod(original)
+++ doc/trunk/design/syn/S02.podThu Jan 10 13:05:42 2008
@@ -12,9 +12,9 @@
 
   Maintainer: Larry Wall [EMAIL PROTECTED]
   Date: 10 Aug 2004
-  Last Modified: 5 Jan 2008
+  Last Modified: 10 Jan 2008
   Number: 2
-  Version: 124
+  Version: 125
 
 This document summarizes Apocalypse 2, which covers small-scale
 lexical items and typological issues.  (These Synopses also contain
@@ -706,6 +706,41 @@
 erroneous to pass such a non-dimensional number to a routine that
 would interpret it with the wrong units.
 
+Implementation note: since Perl 6 mandates that the default Unicode
+processing level must view graphemes as the fundamental unit rather
+than codepoints, this has some implications regarding efficient
+implementation.  It is suggested that all graphames be translated on
+input to a unique grapheme numbers and represented as integers within
+some kind of uniform array for fast substr access.  For those graphemes
+that have a precomposed form, use of that codepoint is suggested.
+(Note that this means Latin-1 can still be represented internally
+with 8-bit integers.)
+
+For graphemes that have no precomposed form, a temporary private
+id should be assigned that uniquely identifies the grapheme.
+If such identifiers are assigned consistently thoughout the process,
+comparison of two graphemes is no more difficult than the comparison
+of two integers, and comparison of base characters no more different
+than a direct lookup into the id-to-NFD table.
+
+Obviously, any temporary grapheme ids must be translated back to
+some universal form (such as NFD) on output, and normal precomposed
+graphemes may turn into either NFC or NFD forms depending on the
+desired output.  Maintaining a particular grapheme/id mapping over the
+life of the process may have some GC implications for long-running
+processes, but most processes will likely see a limited number of
+non-precomposed graphemes.
+
+If the program has a scope that wants a codepoint view rather than
+a grapheme view, the string visible to that lexical scope must also
+be translated to universal form, just as with output translation.
+Alternately, the temporary grapheme ids may be hidden behind an
+abstraction layer.  In any case, codepoint scope should never see
+any temporary grapheme ids.  (The lexical codepoint declaration
+should probably specify which normalization form it prefers to
+view strings under.  Such a declaration could be applied to input
+translation as well.)
+
 =item *
 
 A CBuf is a stringish view of an array of


[svn:perl6-synopsis] r14488 - doc/trunk/design/syn

2008-01-10 Thread larry
Author: larry
Date: Thu Jan 10 13:57:41 2008
New Revision: 14488

Modified:
   doc/trunk/design/syn/S02.pod

Log:
Another typo, grr


Modified: doc/trunk/design/syn/S02.pod
==
--- doc/trunk/design/syn/S02.pod(original)
+++ doc/trunk/design/syn/S02.podThu Jan 10 13:57:41 2008
@@ -709,7 +709,7 @@
 Implementation note: since Perl 6 mandates that the default Unicode
 processing level must view graphemes as the fundamental unit rather
 than codepoints, this has some implications regarding efficient
-implementation.  It is suggested that all graphames be translated on
+implementation.  It is suggested that all graphemes be translated on
 input to a unique grapheme numbers and represented as integers within
 some kind of uniform array for fast substr access.  For those graphemes
 that have a precomposed form, use of that codepoint is suggested.


[svn:perl6-synopsis] r14489 - doc/trunk/design/syn

2008-01-10 Thread larry
Author: larry
Date: Thu Jan 10 16:14:53 2008
New Revision: 14489

Modified:
   doc/trunk/design/syn/S02.pod

Log:
Clarification requested by moritz++


Modified: doc/trunk/design/syn/S02.pod
==
--- doc/trunk/design/syn/S02.pod(original)
+++ doc/trunk/design/syn/S02.podThu Jan 10 16:14:53 2008
@@ -548,6 +548,10 @@
 also ask for the total string length of an array's elements, in bytes,
 codepoints or graphemes, using these methods C.bytes, C.codes or C.graphs
 respectively on the array.  The same methods apply to strings as well.
+(Note that C.bytes is not guaranteed to be well-defined when the encoding
+is unknown.  Similarly, C.codes is not well-defined unless you know which
+canonicalization is in effect.  Hence, both methods allow an optional argument
+to specify the meaning exactly if it cannot be known from context.)
 
 There is no C.length method for either arrays or strings, because Clength
 does not specify a unit.


Re: [svn:perl6-synopsis] r14489 - doc/trunk/design/syn

2008-01-10 Thread Juerd Waalboer
[EMAIL PROTECTED] skribis 2008-01-10 16:14 (-0800):
 +(Note that C.bytes is not guaranteed to be well-defined when the encoding
 +is unknown.

(This message is a mess; in my defense, it's 5:30 AM here. I just had to
respond, because I have the feeling Perl 6's unicode model is going
exactly the wrong way if I interpret this diff correctly.)

What if the encoding is known, but by accessing the .bytes level one
breaks the consistency?

Rather than a scheme where unicode text strings have an encoding
property, I think a scheme where unicode text strings are just unicode
text strings is better: the *binary* strings can have an encoding
property.

So:

* A Str is a sequence of codepoints, and provides grapheme/glyphs if
  requested. It does not have bytes, and the internal encoding does not
  show except through introspection. The internal encoding can
  theoretically change at Perl's will.
* A Buf is a sequence of bytes, not codepoints or characters of any
  kind.
* A Buf with a defined .encoding:
  - does Str, with transparent decoding (with validity checking)
  - also, transparent encoding

my Str $foo = H€łłø wöŕłđ;
my Buf $bar;
$bar.encoding = utf-8;  # or however a decoding is declared
$bar = $foo;  # gets utf-8 encoded
$bar.bytes;   # [ H, \xE2, \x82, \xAC, ... ]
$bar.codes;   # [ H, €, ł, ... ]
$foo.codes eqv $bar.codes  # true
$foo.bytes;   # Huh? What? Makes no sense - fail

All byte-oriented mechanisms can have an encoding defined somehow:
filehandles, environment variables, Bufs, system call wrappers.

I think that would work much easier than giving Strs encoding
properties. When writing to a file, or a Buf, you're probably using a
lot of Strs, and it would make no sense to have them all encode
differently. Likewise, when reading from IO, a Buf, or anything
byte-oriented, the encoding will have to be known to decode it.

I fail to see how giving a Str any .bytes or .encoding might make sense:
these belong to byte strings, not text strings.

Making it easy to work with the internal byte buffer will take away
means of optimization, ease of changing our mind about the best
implementation encoding, and either security or performance (Either you
check the consistency every time you do something and everything is
slow, or you don't, and everything is potentially insecure when passed
on to other code.) Of course, the current internal encoding and byte
buffer should be accessible somehow, and maybe even writable for the
brave of heart, but IMO certainly not with the way too encouraging
.bytes thing - I'm tempted to call for .HOW.internal.

I think that a Buf with a defined encoding should check whether the data
is valid when reading, but a Str can skip this: Perl itself put the data
there, and trusts its own routines for much better performance.

Please, don't give Strs any byte semantics, but do give Bufs support for
transparent en-/decoding, and perhaps even unicode semantics.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]


Re: [svn:perl6-synopsis] r14489 - doc/trunk/design/syn

2008-01-10 Thread Larry Wall
It's really already very much like you want it to be.  Most Str objects
do not in fact have any byte semantics.  If you say foo.bytes, that
is shorthand for foo.bytes(:nfc, :encUTF-8).  In other words,
you have to tell it what units you want the bytes to be measured in.
It just assumes utf-8 as a convenient default.  Likewise a Str does
not have any codepoint semantics unless you tell it the normalization
to assume.  Most strings are sequences of abstract graphemes; see also
http://www.nntp.perl.org/group/perl.perl6.language/2008/01/msg28281.html
as well as the recent definitions of .bytes, .codes, .graphs, and .chars
in http://svn.pugscode.org/pugs/docs/Perl6/Spec/Functions.pod .

We do still talk about the possibility of multi-level strings,
but that's basically the same as your object that presents both
Str and Buf interfaces.  That's an exception rather than the rule,
and certainly as you say it would need to be well typed as to its
encoding and normalization.  The same considerations apply
between grapheme and codepoint views of the same string, except
there only the normalization is needed, since codepoints are
above the encoding abstraction level.

Larry


Re: [svn:perl6-synopsis] r14489 - doc/trunk/design/syn

2008-01-10 Thread Darren Duncan

At 11:09 PM -0800 1/10/08, Larry Wall wrote:

It's really already very much like you want it to be.  Most Str objects
do not in fact have any byte semantics.  If you say foo.bytes, that
is shorthand for foo.bytes(:nfc, :encUTF-8).  In other words,
you have to tell it what units you want the bytes to be measured in.
It just assumes utf-8 as a convenient default.  Likewise a Str does
not have any codepoint semantics unless you tell it the normalization
to assume.


Oh, that's good then.

Until now my interpretation of the Perl 6 situation is that while Str 
objects were conceptually grapheme strings, which .graphs refers to, 
you could access the currently in-use implementation details of that 
object using .codes and .bytes et al.  Timtoady (user choice of 
abstraction level) and all that.


As such, in my own Muldis D language design, which is heavily 
influenced by Perl 6, and has its character strings as 
highest-possible-abstraction unicode (generally graphemes), I made a 
point that all character string operations were more implementation 
agnostic, hence rather than 'graphs' or 'codes' there are 
'nfc_graphs' or 'nfd_codes' etc.


I'm glad to see, from your latest post, that this is how Perl 6 
actually works as well.  That .codes specifically works in terms of a 
particular normal-form (either a specified one or a default one) 
rather than the current implementation, and so makes this aspect of 
Perl 6 a lot more deterministic while portable.


-- Darren Duncan