Hello,

I am writing a script to convert from one font set (used to write in 
devanagari) to another. This consists of converting certain key sequences to 
others, e.g. Of --> k. To do the conversion process, I've looked at some of the 
single-pass multireplacement suggestions posted in the Python Cookbook which 
build a large regex string with find patterns separated with "|", i.e. a large 
string of alternations (and then they use a dictionary containing the values of 
the keys that were put in the regex string). 

In the process, I found something unexpected about regex patterns that use 
alternation: the match is not greedy. The result depends on which pattern is 
listed first.

###
>>> import re
>>> pat=re.compile('this|th')
>>> s='thistle'
>>> pat.match(s).group(0)
'this'
>>> pat=re.compile('th|this')
>>> pat.match(s).group(0)
'th'
###


I read the regex docs about the 'match' function and they say that the 
'leftmost non-overlapping occurrences of pattern in string' will be returned. I 
can understand why the pattern 'light|sli' (or 'sli|light') will match 'sli' in 
text 'slight' (because 'sli' is more to the left than 'light'). But why isn't 
the greedy result of 'this' obtained for either of the patterns given in the 
example above (between the ###)? Is this because it is trying to return a 
pattern that overlaps as little as possible with anything else? 

Basically what this is meaning for me so far is that a single-pass solution to 
replacing certain text sequences with others cannot always be done unless the 
patterns to be found are non-overlapping themselves. In the case of 'light|sli' 
even though 'light' appears first in the alternation, 'sli' will be found with 
higher priority in sequences that had 'slight' in them since 'sli' is more to 
the left.

Any comments on the non-greediness of the alternation pattern or on the 
multi-replacement process in general are appreciated.

/c
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to