A friend of mine got bitten by an expectations bug. he was using
re.findall to look for all occurances of strings matching a pattern, and a
substring he *knew* was in there did not pop out.
the bug was that it overlapped another matching substring, and findall
only returns non-overlapping strings. This is documented; he just missed
it.
But he asked me, is there a standard method to get even overlapped
strings?
Cut to its basics, here's an artificial example:
>>> import re
>>> rexp=re.compile("B.B")
>>> sequence="BABBEBIB"
>>> rexp.findall(sequence)
['BAB', 'BEB']
What he would have wanted was the list ['BAB', 'BEB', 'BIB']; but since
the last 'B' in "BEB" is also the firt 'B' in "BIB", "BIB" is not picked
up.
After looking through the docs, I couldn't find a way to do this in
standard methods, so I gave him a quick RYO solution:
>>> def myfindall(regex, seq):
... resultlist=[]
... pos=0
...
... while True:
... result = regex.search(seq, pos)
... if result is None:
... break
... resultlist.append(seq[result.start():result.end()])
... pos = result.start()+1
... return resultlist
...
>>> myfindall(rexp,sequence)
['BAB', 'BEB', 'BIB']
But just curious; are we reinventing the wheel here? Is there already a
way to match even overlapping substrings? I'm surprised I can't find one.
_______________________________________________
Tutor maillist - [email protected]
http://mail.python.org/mailman/listinfo/tutor