Author: larry Date: Mon Jul 7 21:30:08 2008 New Revision: 14557 Modified: doc/trunk/design/syn/S05.pod
Log: Clarify the role of whitespace within transliterations Power up transliterations with regexes and closures Formally define the implied alternation as equivalent to longest-token matching Modified: doc/trunk/design/syn/S05.pod ============================================================================== --- doc/trunk/design/syn/S05.pod (original) +++ doc/trunk/design/syn/S05.pod Mon Jul 7 21:30:08 2008 @@ -14,9 +14,9 @@ Maintainer: Patrick Michaud <[EMAIL PROTECTED]> and Larry Wall <[EMAIL PROTECTED]> Date: 24 Jun 2002 - Last Modified: 21 Jun 2008 + Last Modified: 7 Jul 2008 Number: 5 - Version: 82 + Version: 83 This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them I<regex> rather than "regular @@ -3661,12 +3661,25 @@ $str.=trans( 'A'=>'a', 'B'=>'b', 'C'=>'c' ); +Whitespace characters are taken literally as characters to be +translated from or to. The C<..> range sequence is the only metasyntax +recognized within a string, though you may of course use backslash +interpolations in double quotes. If the right side is too short, the +final character is replicated out to the length of the left string. +If there is no final character because the right side is the null +string, the result is deletion instead. + =item * -The two sides of each pair may also be Array objects: +Either or both sides of the pair may also be Array objects: $str.=trans( ['A'..'C'] => ['a'..'c'], <X Y Z> => <x y z> ); +The array version is the underlying primitive form: the semantics of +the string form is exactly equivalent to first doing C<..> expansion +and then splitting the string into individual characters and then +using that as an array. + =item * The array version can map one-or-more characters to one-or-more @@ -3675,11 +3688,36 @@ $str.=trans( [' ', '<', '>', '&' ] => [' ', '<', '>', '&' ]); - In the case that more than one sequence of input characters matches, the longest one wins. In the case of two identical sequences the first in order wins. +=item * + +The recognition done by the string and array forms is very basic. +To achieve greater power, any recognition element of the left side +may be specified by a regex that can do character classes, lookahead, +etc. + + + $str.=trans( [/ \h /, '<', '>', '&' ] => + [' ', '<', '>', '&' ]); + + $str.=trans( / \s+ /, ' ' ); # squash all whitespace to one space + +These submatches are mixed into the overall match in exactly the same way that +they are mixed into parallel alternation in ordinary regex processing, so +longest token rules apply across all the possible matches specified to the +transliteration operator. Once a match is made and transliterated, the parallel +matching resumes at the new position following the end of the previous match, +even if it matched multiple characters. + +=item * + +If the right side of the arrow is a closure, it is evaluated to +determine the replacement value. If the left side was matched by a +regex, the resulting match object is available within the closure. + =back =head1 Substitution