Re: character classes in p6 rules

2005-05-12 Thread Larry Wall
On Wed, May 11, 2005 at 08:00:20PM -0500, Patrick R. Michaud wrote:
: Somehow I'd like to get rid of those inner angles, so 
: that we always use  +alpha, +digit, -sp, -punct to 
: indicate named character classes, and specify combinations 
: with constructions like  +alpha+punct-[aeiou]  and  +word-[_].  
: We'd still allow [abc] as a shortcut to +[abc].

I like it.

: I haven't thought far ahead to the question of whether
: character classes would continue to occupy the same namespace
: as rules (as they do now) or if they become specialized kinds
: of rules or what.  I'll just leave it at this for now and
: see what the rest of p6l thinks.

Hmm, well, positive matches can be defined to traverse whatever the
longest sequence matched is, even if it's actually multiple characters
by some reckoning or other.  On the other hand, negative matches
can really only skip one character in the current view regardless of
how long the sequences in the class are, which function as a negative
lookahead for the subsequent character skip.  In other words, -alpha
really means something like [!alpha .]

But then it's not entirely clear how character class set theory works.
Another thing we have to work out.  Obviously + and - are ordered,
and we probably want  and | for actual set operations.  But does
-[a] negate only a preceding 'a' or all characters that use 'a'
as the base character along with subsequent combining characters?
We're almost getting into a wildcarding situation there...

In any event, the takehome message here is that characters cannot
be assumed to be constant width any more.

I think this argues that character classes really are rules of a sort.

Larry


character classes in p6 rules

2005-05-11 Thread Patrick R. Michaud
I now have a basic implementation for enumerated character classes in 
the grammar engine (i.e., [xyz], -[xyz], [x..z], and -[x..z]).

I didn't see it specified anywhere, but are the \d, \D, \s, \S, etc.
metacharacters still supposed to work inside of a enumerated character 
class, as they do in Perl 5?   Or in p6 do we always use
+digit+[xyz], -digit, +sp, -sp, etc.?

(Yes, I know that normally the absence of any spec to the contrary
indicates that we're still using p5 semantics, but this one is worth 
verification for me.)

While I'm on the subject, let me just ramble a bit -- there are 
times when alpha, digit, upper, etc. give me a bad feeling 
-- they look a little too much like subrules to me, especially 
when looking at +alpha and the like.  I keep wondering about 
things like +ident and -expr.

And something like  C rx / alpha* /   may generate a lot
of not-very-useful one-character captures into $/alpha , so that
we'll typically want to get in the habit of writing 

rx / ?alpha* /
rx / +alpha* /

and then have the engine recognize when this occurs so it
can optimize to a much faster character class op rather than
a lot of calls to a separate subrule.

Plus, +alpha just looks plain ugly and unbalanced to me.  
Somehow I'd like to get rid of those inner angles, so 
that we always use  +alpha, +digit, -sp, -punct to 
indicate named character classes, and specify combinations 
with constructions like  +alpha+punct-[aeiou]  and  +word-[_].  
We'd still allow [abc] as a shortcut to +[abc].

To me this looks cleaner overall, makes it clear we're doing a
one-character non-capturing match, and may enable a few optimization
possibilities.  (I'm sure that with enough effort we can get 
equivalent optimizations out of the existing syntax, and we may
need them anyway in the long run, but this might simplify that a 
fair bit.)

I haven't thought far ahead to the question of whether
character classes would continue to occupy the same namespace
as rules (as they do now) or if they become specialized kinds
of rules or what.  I'll just leave it at this for now and
see what the rest of p6l thinks.

Pm