Re: [Tutor] Help with re.sub()

Kent Johnson Thu, 16 Mar 2006 21:02:56 -0800

John Clark wrote:
> Hi,
> 
> I have a file that is a long list of records (roughly) in the format
> 
> [EMAIL PROTECTED]
> 
> So, for example:
> 
> [EMAIL PROTECTED]
> [EMAIL PROTECTED]
> [EMAIL PROTECTED]
> [EMAIL PROTECTED]
> [EMAIL PROTECTED]
> ....
> 
> What I would like to do is run a regular expression against this and
> wind up with:
> 
> [EMAIL PROTECTED]@[EMAIL PROTECTED]@data4
> [EMAIL PROTECTED]


Regular expressions aren't so good at dealing with repeating data like 
this. OTOH itertools.groupby() is perfect for this:

# This represents your original data
data = '''[EMAIL PROTECTED]
[EMAIL PROTECTED]
[EMAIL PROTECTED]
[EMAIL PROTECTED]
[EMAIL PROTECTED]
[EMAIL PROTECTED]'''.splitlines()

# Convert to a list of pairs of (id, data)
data = [ line.split('@') for line in data ]

from itertools import groupby
from operator import itemgetter

# groupby() will group them according to whatever key we specify
# itemgetter(0) will pull out just the first item
# the result of groupby() is a list of (key, sequence of items)
for id, items in groupby(data, itemgetter(0)):
     print '[EMAIL PROTECTED]' % (id, '@'.join(item[1] for item in items))

I have a longer explanation of groupby() and itemgetter() here:
http://www.pycs.net/users/0000323/weblog/2005/12/06.html

> So, my questions are:
> (1) Is there any way to get a single regular expression to handle
> overlapping matches so that I get what I want in one call?

I doubt it though I'd be happy to be proven wrong ;)

> (2) Is there any way (without comparing the before and after strings) to
> know if a re.sub(...) call did anything?

Use re.subn() instead, it returns the new string and a count.

Kent

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Help with re.sub()

Reply via email to