Re: Transliteration preferring longest match

2005-12-16 Thread Ruud H.G. van Tol
John Macdonald:

 [trans]
 If a shorter rule is allowed to match first, then the longer
 rule can be removed from the match set, at least for constant
 string matches.

It is not about the length of the rules, but about the length of the
matches.

If both \s+ and \h+ match the same length, should then \h+ be honored
because it is more specific?

And are we only talking about matches at the same position? (Stepping
through the input-buffer character-by-character, and testing each
pattern.)


 If, for example, '=' can match without
 preferring to try first for '==' then you'll never match '=='
 without syntactic help to force a backtracking retry.

If rules will match in order of appearance, it is to the user to put the
rules in the right order.

Some help can be provided, like a warning when an 'ab' precedes an
'abc', and maybe even when an 'a*' precedes an 'a+'.

-- 
Grtz, Ruud



Re: Transliteration preferring longest match

2005-12-16 Thread Larry Wall
On Fri, Dec 16, 2005 at 01:29:11PM +0100, Ruud H.G. van Tol wrote:
: John Macdonald:
: 
:  [trans]
:  If a shorter rule is allowed to match first, then the longer
:  rule can be removed from the match set, at least for constant
:  string matches.
: 
: It is not about the length of the rules, but about the length of the
: matches.

They're the same thing when only fixed strings are allowed.  We've only
been calling the rules for lack of a better term, but they aren't
rules, or even regexes.  Transliteration is literal.

: If both \s+ and \h+ match the same length, should then \h+ be honored
: because it is more specific?

If you want ordered matches of real rules, you should use /@array/
in a real match.  The pattern equivalent to tr/// involves /%hash/
instead somehow.

: And are we only talking about matches at the same position? (Stepping
: through the input-buffer character-by-character, and testing each
: pattern.)

Both s/// and tr/// have ways of skipping non-matching characters, but
they're not the same.

It would be a useful exercise to write tr/// in terms of s///.
It occurs to me that it'd be awfully useful to have a kind of hash
that returns any unmatched key unchanged.  But there's actually a
subtle conflict between how you want to use the hash on the left
and on the right.  They have the same keys but different values.

s/(%match)/%replace{$0}/

The way we've got hashes defined currently on the left, the lookup finds
an additional rule to continue parsing, on the assumption that the key
of %match is merely the first keyword of some longer construct.  But the
value can't simultaneously be a subsequent rule *and* a replacement value,
so we end up looking up the same string twice in two different hashes.
(Even if the first one is actually doing a trie internally or some such,
it's still effectively a hash lookup, and why do it twice?.)  Maybe there's
some way to write the rule in the value of %match that matches zero
width but returns a useful value so it'd be something more like:

s/(%match)/$0[0]/

:  If, for example, '=' can match without
:  preferring to try first for '==' then you'll never match '=='
:  without syntactic help to force a backtracking retry.
: 
: If rules will match in order of appearance, it is to the user to put the
: rules in the right order.
: 
: Some help can be provided, like a warning when an 'ab' precedes an
: 'abc', and maybe even when an 'a*' precedes an 'a+'.

Yes, that would be useful for /@array/ analysis.  But tr/// ain't that.

Larry


Re: Transliteration preferring longest match

2005-12-16 Thread Larry Wall
On Fri, Dec 16, 2005 at 09:14:52AM -0800, Larry Wall wrote:
: It would be a useful exercise to write tr/// in terms of s///.
: It occurs to me that it'd be awfully useful to have a kind of hash
: that returns any unmatched key unchanged.

Actually, in this case it's handled by the fact that the null key
always matches if nothing longer matches.  So tr/abc/ABC/ turns into

$matcha = /{$ := 'A'}/
$matchb = /{$ := 'B'}/
$matchc = /{$ := 'C'}/
$match  = /./;
s/%match/$/g;

or some such.

Larry


Re: Transliteration preferring longest match

2005-12-16 Thread Ruud H.G. van Tol
Larry Wall:
 Ruud H.G. van Tol:
 John Macdonald:

 [trans]
 If a shorter rule is allowed to match first, then the longer
 rule can be removed from the match set, at least for constant
 string matches.

 It is not about the length of the rules, but about the length of the
 matches.

 They're the same thing when only fixed strings are allowed.  We've
 only been calling the[m] rules for lack of a better term, but they
 aren't rules, or even regexes.  Transliteration is literal.

Yes, I let rules carry me away. Maybe with a little help from
misreading the phrase In the case that more than one sequence of input
characters matches (S05). And of using PHP's preg_replace() too.

-- 
Grtz, Ruud



Re: Transliteration preferring longest match

2005-12-16 Thread Brad Bowman


This is only about transliteration (tr///), not rules in general.
So you are only matching a fixing set of strings at a certain position.
If one string is the prefix of another, the longer is preferred.
If there are two identical match strings, the replacement corresponding
to the first is used.

My original post was confused and confusing.

Brad

--
Among the words spoken by great generals, there are some that were said
offhandedly. One should not receive these words in the same manner, however.
   -- Hagakure http://bereft.net/hagakure/


Re: Transliteration preferring longest match

2005-12-15 Thread Luke Palmer
On 12/15/05, Brad Bowman [EMAIL PROTECTED] wrote:
 Why does the longest input sequence win?
Is it for some consistency that that I'm not seeing? Some exceedingly
 common use case?  The rule seems unnecessarily restrictive.

Hmm.  Good point.  You see, the longest token wins because that's an
exceedingly common rule in lexers, and you can't sort regular
expressions the way you can sort strings, so there needs to be special
machinery in there.

There are two rather weak arguments to keep the longest token rule:

* We could compile the transliteration into a DFA and make it
fast.  Premature optimization.
* We could generalize transliteration to work on rules as well.

In fact, I think the first Perl module I ever wrote was
Regexp::Subst::Parallel, which did precisely the second of these. 
That's one of the easy things that was hard in Perl (but I guess
that's what CPAN is for).  Hmm.. none of these is really a compelling
argument either way.

Luke


Re: Transliteration preferring longest match

2005-12-15 Thread Larry Wall
On Thu, Dec 15, 2005 at 06:50:19PM +0100, Brad Bowman wrote:
: 
: Hi,
: 
: S05 describes an array version of trans for transliteration:
:  ( http://dev.perl.org/perl6/doc/design/syn/S05.html#Transliteration )
: 
:   The array version can map one-or-more characters to one-or-more 
:   characters:
: 
:  $str.=trans( [' ',  '','',''] =
:   ['nbsp;', 'lt;', 'gt;', 'amp;' ]);
: 
:   In the case that more than one sequence of input characters matches,
:   the longest one wins. In the case of two identical sequences the first
:   in order wins.
: 
: Why does the longest input sequence win? 
:   Is it for some consistency that that I'm not seeing? Some exceedingly
: common use case?  The rule seems unnecessarily restrictive.

On the contrary, it frees the user from having to worry about the order.

: The first in order rule is more flexible, the user can sort their
: arrays to produce the longest input rule, or use another order if that is
: preferred.

What possible use is a user-ordered rule set?  If you put the shorter
entry first, the longer one can never be reached.  It's not like you
can backtrack into a transliteration and pick a different entry.

: The first transliteration example even uses sort in
: the pair-wise form:
: 
:   $str.trans( %mapping.pairs.sort );

That seems like a useless use of sort, and probably defeats the optimizer
as well.

: Can we drop the longest preference?

Doesn't hurt anything, and can probably help.  Plus we already have
the longest token rule in there for magical hash matching in rules,
so it's likely the optimizer will already know how to handle it,
or something like it.

Larry


Re: Transliteration preferring longest match

2005-12-15 Thread John Macdonald
On Thu, Dec 15, 2005 at 09:56:09PM +, Luke Palmer wrote:
 On 12/15/05, Brad Bowman [EMAIL PROTECTED] wrote:
  Why does the longest input sequence win?
 Is it for some consistency that that I'm not seeing? Some exceedingly
  common use case?  The rule seems unnecessarily restrictive.
 
 Hmm.  Good point.  You see, the longest token wins because that's an
 exceedingly common rule in lexers, and you can't sort regular
 expressions the way you can sort strings, so there needs to be special
 machinery in there.
 
 There are two rather weak arguments to keep the longest token rule:
 
 * We could compile the transliteration into a DFA and make it
 fast.  Premature optimization.
 * We could generalize transliteration to work on rules as well.
 
 In fact, I think the first Perl module I ever wrote was
 Regexp::Subst::Parallel, which did precisely the second of these. 
 That's one of the easy things that was hard in Perl (but I guess
 that's what CPAN is for).  Hmm.. none of these is really a compelling
 argument either way.

If a shorter rule is allowed to match first, then the longer
rule can be removed from the match set, at least for constant
string matches.  If, for example, '=' can match without
preferring to try first for '==' then you'll never match '=='
without syntactic help to force a backtracking retry.

--