On Mon, Aug 30, 2010 at 10:52 PM, Sam M <sm9...@gmail.com> wrote: > Hi Guys, > > I'd like remove contents between tags <email> that matches pattern "WORD1" > as follows: > > Change > "stuff <email>word1-emai...@domain.com</email> more stuff > <email>word1-emai...@domain.com</email> still more stuff > <email>word2-emai...@domain.com</email> stuff after WORD2 > <email>word1-emai...@domain.com</email>" > > To > "stuff more stuff still more stuff <email>word2-emai...@domain.com</email> > stuff after WORD2 " > > The following did not work > newl = re.sub (r'<email>WORD1-.*</email>',"",line) >
This precise problem is actually described in the re documentation on python.org: http://docs.python.org/howto/regex.html#greedy-versus-non-greedy In short: .* is greedy and gobbles up as much as it can. That means </email> will resolve to the last </email> tag in the line, and all the previous ones are simply eaten by .* To solve, we have the non-greedy patterns. They eat not as much possible, but as little as possible. To make a qualifier non-greedy, simply add an asterix at its end: r'<email>WORD1-.*?</email>' Hugo _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor