Thomas wrote: > I have the following mostly working function to strip the first 4 > digit year out of some text. But a leading space confounds it for > years starting 20..: > > import re > def getyear(text): > s = """(?:.*?(19\d\d)|(20\d\d).*?)""" > p = re.compile(s,re.IGNORECASE|re.DOTALL) #|re.VERBOSE > y = p.match(text) > try: > return y.group(1) or y.group(2) > except: > return '' > > > >>>> getyear('2002') > '2002' >>>> getyear(' 2002') > '' >>>> getyear(' 1902') > '1902' > > A regex of ".*?" means any number of any characters, with a non-greedy > hunger (so to speak) right? > > Any ideas on what is causing this to fail?
The | character has very low precedence in a regex. You are matching either - any number of characters followed by 19xx or, - 20xx followed by any number of characters You could use this instead: .*?(?:(19\d\d)|(20\d\d)).*? But why not use p.search(), which will find the string anywhere without needing the wildcards? Then your regex could be just 19\d\d|20\d\d and you return just y.group() Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor