Re: string encoding regex problem
Hi, On 2014-08-16 09:01:57 +, Peter Otten said: Philipp Kraus wrote: The code works till last week correctly, I don't change the pattern. Websites' contents and structure change sometimes. My question is, can it be a problem with string encoding? Your regex is all-ascii. So an encoding problem is very unlikely. found = re.search( a href=\/projects/boost/files/latest/download\?source=files\ title=\/boost/(.*), data) Did I mask the question mark and quotes correctly? Yes. A quick check... data = urllib.urlopen(http://sourceforge.net/projects/boost/files/boost/;).read() re.compile(/projects/boost/files/latest/download\?source=files.*?).findall(data) ['/projects/boost/files/latest/download?source=files title=/boost-docs/1.56.0/boost_1_56_pdf.7z: released on 2014-08-14 16:35:00 UTC'] ...reveals that the matching link has /boost-docs/ in its title, so the site contents probably did change. I have create a short script: - #!/usr/bin/env python import re, urllib2 def URLReader(url) : f = urllib2.urlopen(url) data = f.read() f.close() return data print re.match( \small\ \.*\\/small\, URLReader(http://sourceforge.net/projects/boost/;) ) - Within the data the string smallboost_1_56_0.tar.gz/small should be machted, but I get always a None result on the re.match, re.search returns also a None. I have tested the regex under http://regex101.com/ with the HTML code and on the page the regex is matched. Can you help me please to fix the problem, I don't understand that the match returns None Thanks Phil-- https://mail.python.org/mailman/listinfo/python-list
Re: string encoding regex problem
Philipp Kraus wrote: I have create a short script: - #!/usr/bin/env python import re, urllib2 def URLReader(url) : f = urllib2.urlopen(url) data = f.read() f.close() return data print re.match( \small\ \.*\\/small\, URLReader(http://sourceforge.net/projects/boost/;) ) - Within the data the string smallboost_1_56_0.tar.gz/small should be machted, but I get always a None result on the re.match, re.search returns also a None. help(re.match) Help on function match in module re: match(pattern, string, flags=0) Try to apply the pattern at the start of the string, returning a match object, or None if no match was found. As the string doesn't start with your regex re.match() is clearly wrong, but re.search() works for me: import re, urllib2 def URLReader(url) : ... f = urllib2.urlopen(url) ... data = f.read() ... f.close() ... return data ... data = URLReader(http://sourceforge.net/projects/boost/;) re.search(\small\ \.*\\/small\, data) _sre.SRE_Match object at 0x7f282dd58718 _.group() 'small boost_1_56_pdf.7z/small' I have tested the regex under http://regex101.com/ with the HTML code and on the page the regex is matched. Can you help me please to fix the problem, I don't understand that the match returns None -- https://mail.python.org/mailman/listinfo/python-list
Re: string encoding regex problem
Philipp Kraus wrote: The code works till last week correctly, I don't change the pattern. Websites' contents and structure change sometimes. My question is, can it be a problem with string encoding? Your regex is all-ascii. So an encoding problem is very unlikely. found = re.search( a href=\/projects/boost/files/latest/download\?source=files\ title=\/boost/(.*), data) Did I mask the question mark and quotes correctly? Yes. A quick check... data = urllib.urlopen(http://sourceforge.net/projects/boost/files/boost/;).read() re.compile(/projects/boost/files/latest/download\?source=files.*?).findall(data) ['/projects/boost/files/latest/download?source=files title=/boost-docs/1.56.0/boost_1_56_pdf.7z: released on 2014-08-14 16:35:00 UTC'] ...reveals that the matching link has /boost-docs/ in its title, so the site contents probably did change. -- https://mail.python.org/mailman/listinfo/python-list
string encoding regex problem
Hello, I have defined a function with: def URLReader(url) : try : f = urllib2.urlopen(url) data = f.read() f.close() except Exception, e : raise MyError.StopError(e) return data which get the HTML source code from an URL. I use this to get a part of a HTML document without any HTML parsing, so I call (I would like to get the download link of the boost library): found = re.search( a href=\/projects/boost/files/latest/download\?source=files\ title=\/boost/(.*), Utilities.URLReader(http://sourceforge.net/projects/boost/files/boost/;) ) if found == None : raise MyError.StopError(Boost Download URL not found) But found is always None, so I cannot get the correct match. I didn't find the error in my code. Thanks for help Phil-- https://mail.python.org/mailman/listinfo/python-list
Re: string encoding regex problem
In article lsm8ic$j90$1...@online.de, Philipp Kraus philipp.kr...@flashpixx.de wrote: found = re.search( a href=\/projects/boost/files/latest/download\?source=files\ title=\/boost/(.*), Utilities.URLReader(http://sourceforge.net/projects/boost/files/boost/;) ) if found == None : raise MyError.StopError(Boost Download URL not found) But found is always None, so I cannot get the correct match. I didn't find the error in my code. I would start by breaking this down into pieces. Something like: data = Utilities.URLReader(http://sourceforge.net/projects/boost/files/boost/;) ) print data found = re.search( a href=\/projects/boost/files/latest/download\?source=files\ title=\/boost/(.*), data) if found == None : raise MyError.StopError(Boost Download URL not found) Now at least you get to look at what URLReader() returned. Did it return what you expected? If not, then there might be something wrong in your URLReader() function. If it is what you expected, then I would start looking at the pattern to see if it's correct. Either way, you've managed to halve the size of the problem. -- https://mail.python.org/mailman/listinfo/python-list
Re: string encoding regex problem
On 2014-08-16 00:48:46 +, Roy Smith said: In article lsm8ic$j90$1...@online.de, Philipp Kraus philipp.kr...@flashpixx.de wrote: found = re.search( a href=\/projects/boost/files/latest/download\?source=files\ title=\/boost/(.*), Utilities.URLReader(http://sourceforge.net/projects/boost/files/boost/;) ) if found == None : raise MyError.StopError(Boost Download URL not found) But found is always None, so I cannot get the correct match. I didn't find the error in my code. I would start by breaking this down into pieces. Something like: data = Utilities.URLReader(http://sourceforge.net/projects/boost/files/boost/;) ) print data found = re.search( a href=\/projects/boost/files/latest/download\?source=files\ title=\/boost/(.*), data) if found == None : raise MyError.StopError(Boost Download URL not found) Now at least you get to look at what URLReader() returned. Did it return what you expected? If not, then there might be something wrong in your URLReader() function. I have check the result of the (sorry, I forgot this information on my first post). The URLReader returns the HTML code of the URL, so this seems to work correctly If it is what you expected, then I would start looking at the pattern to see if it's correct. Either way, you've managed to halve the size of the problem. The code works till last week correctly, I don't change the pattern. My question is, can it be a problem with string encoding? Did I mask the question mark and quotes correctly? Phil -- https://mail.python.org/mailman/listinfo/python-list
Re: string encoding regex problem
In article lsmeej$49n$1...@online.de, Philipp Kraus philipp.kr...@flashpixx.de wrote: The code works till last week correctly, I don't change the pattern. OK, so what did you change? Can you go back to last week's code and compare it to what you have now to see what changed? My question is, can it be a problem with string encoding? Did I mask the question mark and quotes correctly? The best thing to do with regular expressions is to use raw strings, i.e. r'this is a string'. The nice thing about that is backslashes are not special. It makes it about 1000% easier to write complicated regular expressions. Simple ones are only 500% easier. -- https://mail.python.org/mailman/listinfo/python-list
Re: string encoding regex problem
Philipp Kraus wrote: The code works till last week correctly, I don't change the pattern. My question is, can it be a problem with string encoding? Did I mask the question mark and quotes correctly? If you didn't change the code, how could the *exact same code* not mask the question mark last week, but this week suddenly start masking it, despite not changing? There are three things that can cause a change in behaviour: - the re module has changed; - the pattern has changed; - the text you are searching has changed. Have you removed the re module and replaced it with a different one? Did you update Python to a new version? Have you changed the regex search pattern? Has the text you are searching changed? Websites upgrade their HTML quite frequently. Perhaps the Boost website has changed enough to break your regex. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: regex (?!..) problem
Stefan Behnel wrote: Wolfgang Rohdewald wrote: I want to match a string only if a word (C1 in this example) appears at most once in it. def match(s): if s.count(C1) 1: return None return s If this doesn't fit your requirements, you may want to provide some more details. That will return a false value if s is the empty string. How about: def match(s): return s.count(C1) = 1 -- HansM -- http://mail.python.org/mailman/listinfo/python-list
Re: regex (?!..) problem
On Oct 4, 9:34 pm, Wolfgang Rohdewald wolfg...@rohdewald.de wrote: Hi, I want to match a string only if a word (C1 in this example) appears at most once in it. This is what I tried: re.match(r'(.*?C1)((?!.*C1))','C1b1b1b1 b3b3b3b3 C1C2C3').groups() ('C1b1b1b1 b3b3b3b3 C1', '') re.match(r'(.*?C1)','C1b1b1b1 b3b3b3b3 C1C2C3').groups() ('C1',) but this should not have matched. Why is the .*? behaving greedy if followed by (?!.*C1)? It's not. I would have expected that re first evaluates (.*?C1) before proceeding at all. It does. What you're not realizing is that if a regexp search comes to a dead end, it won't simply return no match. Instead it'll throw away part of the match, and backtrack to a previously-matched variable-length subexpression, such as .*?, and try again with a different length. That's what happened above. At first the group (.*?C1) non-greedily matched the substring C1, but it couldn't find a match under those circumstances, so it backtracked to the .*?. and looked a longer match, which it found. Here's something to keep in mind: except for a few corner cases, greedy versus non-greedy will not affect the substring matched, it'll only affect the groups. I also tried: re.search(r'(.*?C1(?!.*C1))','C1b1b1b1 b3b3b3b3 C1C2C3C4').groups() ('C1b1b1b1 b3b3b3b3 C1',) with the same problem. How could this be done? Can't be done with regexps. How you would do this kind of depends on your overall goals, but your first look should be toward the string methods. If you share details with us we can help you choose a better strategy. Carl Banks -- http://mail.python.org/mailman/listinfo/python-list
Re: regex (?!..) problem
On Monday 05 October 2009, Carl Banks wrote: What you're not realizing is that if a regexp search comes to a dead end, it won't simply return no match. Instead it'll throw away part of the match, and backtrack to a previously-matched variable-length subexpression, such as .*?, and try again with a different length. well, that explains it. This is contrary to what the documentation says, though. Should I fill a bug report? http://docs.python.org/library/re.html Now back to my original problem: Would you have any idea how to solve it? count() is no solution in my case, I need re.search to either return None or a match. -- Wolfgang -- http://mail.python.org/mailman/listinfo/python-list
Re: regex (?!..) problem
Wolfgang Rohdewald wrote: I want to match a string only if a word (C1 in this example) appears at most once in it. def match(s): if s.count(C1) 1: return None return s If this doesn't fit your requirements, you may want to provide some more details. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: regex (?!..) problem
On Oct 4, 11:17 pm, Wolfgang Rohdewald wolfg...@rohdewald.de wrote: On Monday 05 October 2009, Carl Banks wrote: What you're not realizing is that if a regexp search comes to a dead end, it won't simply return no match. Instead it'll throw away part of the match, and backtrack to a previously-matched variable-length subexpression, such as .*?, and try again with a different length. well, that explains it. This is contrary to what the documentation says, though. Should I fill a bug report?http://docs.python.org/library/re.html If you're referring to the section where it explains greedy qualifiers, it is not wrong per se. re.match does exactly what the documentation says: it matches as few characters as possible to the non-greedy pattern. However, since it's easy to misconstrue that if you don't know about regexp backtracking, perhaps a little mention of backtracking is is warranted. IMO it's not a documentation bug, so if you want to file a bug report I'd recommend filing as a wishlist item. I will mention that my followup contained an error (which you didn't quote). I said greedy versus non-greedy doesn't affect the substring matched. That was wrong, it does affect the substring matched; what it doesn't affect is whether there is a match found. Now back to my original problem: Would you have any idea how to solve it? count() is no solution in my case, I need re.search to either return None or a match. Why do you have to use a regexp at all? In Python we recommend using string operations and methods whenever reasonable, and avoiding regexps unless you specifically need their extra power. String operations can easily do the examples you posted, so I see no reason to use regexps. Depending on what you want to do with the result, one of the following functions should be close to what you need. (I am using word to refer to the string being matched against, token to be the thing you don't want to appear more than once.) def token_appears_once(word,token): return word.count(token) == 1 def parts(word,token): head,sep,tail = word.partition(C1) if sep == or C1 in tail: return None return (head,sep,tail) If you really need a match object, you should do a search, and then call the .count method on the matched substring to see if there is more than one occurrence, like this: def match_only_if_token_appears_once(pattern,wotd,token): m = re.search(pattern,word) if m.group(0).count(C1) != 1: m = None return m Carl Banks -- http://mail.python.org/mailman/listinfo/python-list
Re: regex (?!..) problem
On Monday 05 October 2009, Stefan Behnel wrote: Wolfgang Rohdewald wrote: I want to match a string only if a word (C1 in this example) appears at most once in it. def match(s): if s.count(C1) 1: return None return s If this doesn't fit your requirements, you may want to provide some more details. Well - the details are simple and already given: I need re.search to either return None or a match. But I will try to state it differently: I have a string representing the results for a player of a board game (Mah Jongg - not the solitaire but the real one, played by 4 players), and I have a list of scoring rules. Those rules can be modified by the user, he can also add new rules. Mah Jongg is played with very different rulesets worldwide. The rules are written as regular expressions. Since what they do varies greatly I do not want do treat some of them in a special way. That would theoretically be possible but it would really complificate things. For each rule I simply need to check whether it applies or not. I do that by calling re.search(rule, gamestring) and by checking the result against None. Here you can look at all rules I currently have. http://websvn.kde.org/trunk/playground/games/kmj/src/predefined.py?view=markup The rule I want to rewrite is called Robbing the Kong. Of course it is more complicated than my example with C1. Here you can find the documentation for the gamestring: http://websvn.kde.org/trunk/playground/games/doc/kmj/index.docbook?revision=1030476view=markup (get HTML files with meinproc index.docbook) -- Wolfgang -- http://mail.python.org/mailman/listinfo/python-list
Re: regex (?!..) problem
On Monday 05 October 2009, MRAB wrote: (?!.*?(C1).*?\1) will succeed only if .*?(C1).*?\1 has failed, in which case the group (group 1) will be undefined (no capture). I see. I should have moved the (C1) out of this expression anyway: re.match(r'L(?Ptile..)(?!.*?(?P=tile).*?(?P=tile))(.*? (?P=tile))','LC1 C1B1B1B1 b3b3b3b3 C2C2C3').groups() ('C1', ' C1') this solves my problem, thank you! -- Wolfgang -- http://mail.python.org/mailman/listinfo/python-list
regex (?!..) problem
Hi, I want to match a string only if a word (C1 in this example) appears at most once in it. This is what I tried: re.match(r'(.*?C1)((?!.*C1))','C1b1b1b1 b3b3b3b3 C1C2C3').groups() ('C1b1b1b1 b3b3b3b3 C1', '') re.match(r'(.*?C1)','C1b1b1b1 b3b3b3b3 C1C2C3').groups() ('C1',) but this should not have matched. Why is the .*? behaving greedy if followed by (?!.*C1)? I would have expected that re first evaluates (.*?C1) before proceeding at all. I also tried: re.search(r'(.*?C1(?!.*C1))','C1b1b1b1 b3b3b3b3 C1C2C3C4').groups() ('C1b1b1b1 b3b3b3b3 C1',) with the same problem. How could this be done? -- Wolfgang -- http://mail.python.org/mailman/listinfo/python-list
Re: regex (?!..) problem
Why not check it simply by count()? s = '1234C156789' s.count('C1') 1 -- http://mail.python.org/mailman/listinfo/python-list
Re: regex problem ..
Analog Kid wrote: Hi guys: Thanks for your responses. Points taken. Basically, I am looking for a combination of the following ... [^\w] and %(?!20) ... How do I do this in a single RE? Thanks for all you help. Regards, AK On Mon, Dec 15, 2008 at 10:54 PM, Steve Holden st...@holdenweb.com mailto:st...@holdenweb.com wrote: Analog Kid wrote: Hi All: I am new to regular expressions in general, and not just re in python. So, apologies if you find my question stupid :) I need some help with forming a regex. Here is my scenario ... I have strings coming in from a list, each of which I want to check against a regular expression and see whether or not it qualifies. By that I mean I have a certain set of characters that are permissible and if the string has characters which are not permissible, I need to flag that string ... here is a snip ... flagged = list() strs = ['HELLO', 'Hi%20There', '123...@#@'] p = re.compile(r[^a-zA-Z0-9], re.UNICODE) for s in strs: if len(p.findall(s)) 0: flagged.append(s) print flagged my question is ... if I wanted to allow '%20' but not '%', how would my current regex (r[^a-zA-Z0-9]) be modified? The essence of the approach is to observe that each element is a sequence of zero or more character, where character is either letter/digit or escape. So you would use a pattern like ([a-zA-Z0-9]|%[0-9a-f][0-9a-f])+ Did you *try* the above pattern? regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: regex problem ..
Hi guys: Thanks for your responses. Points taken. Basically, I am looking for a combination of the following ... [^\w] and %(?!20) ... How do I do this in a single RE? Thanks for all you help. Regards, AK On Mon, Dec 15, 2008 at 10:54 PM, Steve Holden st...@holdenweb.com wrote: Analog Kid wrote: Hi All: I am new to regular expressions in general, and not just re in python. So, apologies if you find my question stupid :) I need some help with forming a regex. Here is my scenario ... I have strings coming in from a list, each of which I want to check against a regular expression and see whether or not it qualifies. By that I mean I have a certain set of characters that are permissible and if the string has characters which are not permissible, I need to flag that string ... here is a snip ... flagged = list() strs = ['HELLO', 'Hi%20There', '123...@#@'] p = re.compile(r[^a-zA-Z0-9], re.UNICODE) for s in strs: if len(p.findall(s)) 0: flagged.append(s) print flagged my question is ... if I wanted to allow '%20' but not '%', how would my current regex (r[^a-zA-Z0-9]) be modified? The essence of the approach is to observe that each element is a sequence of zero or more character, where character is either letter/digit or escape. So you would use a pattern like ([a-zA-Z0-9]|%[0-9a-f][0-9a-f])+ regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: regex problem ..
Analog Kid wrote: Hi All: I am new to regular expressions in general, and not just re in python. So, apologies if you find my question stupid :) I need some help with forming a regex. Here is my scenario ... I have strings coming in from a list, each of which I want to check against a regular expression and see whether or not it qualifies. By that I mean I have a certain set of characters that are permissible and if the string has characters which are not permissible, I need to flag that string ... here is a snip ... flagged = list() strs = ['HELLO', 'Hi%20There', '123...@#@'] p = re.compile(r[^a-zA-Z0-9], re.UNICODE) for s in strs: if len(p.findall(s)) 0: flagged.append(s) print flagged my question is ... if I wanted to allow '%20' but not '%', how would my current regex (r[^a-zA-Z0-9]) be modified? You might want to normalize before checking, e.g. from urllib import unquote p=re.compile([^a-zA-Z0-9 ]) flagged=[] for s in strs: if p.search(unquote(s)): flagged.append(s) be carefull however if you want to show the flagged ones back to the user. Best is always quote/unquote at the boundaries as appropriate. Regards Tino smime.p7s Description: S/MIME Cryptographic Signature -- http://mail.python.org/mailman/listinfo/python-list
regex problem ..
Hi All: I am new to regular expressions in general, and not just re in python. So, apologies if you find my question stupid :) I need some help with forming a regex. Here is my scenario ... I have strings coming in from a list, each of which I want to check against a regular expression and see whether or not it qualifies. By that I mean I have a certain set of characters that are permissible and if the string has characters which are not permissible, I need to flag that string ... here is a snip ... flagged = list() strs = ['HELLO', 'Hi%20There', '123...@#@'] p = re.compile(r[^a-zA-Z0-9], re.UNICODE) for s in strs: if len(p.findall(s)) 0: flagged.append(s) print flagged my question is ... if I wanted to allow '%20' but not '%', how would my current regex (r[^a-zA-Z0-9]) be modified? TIA, AK -- http://mail.python.org/mailman/listinfo/python-list
Re: regex problem ..
Analog Kid wrote: Hi All: I am new to regular expressions in general, and not just re in python. So, apologies if you find my question stupid :) I need some help with forming a regex. Here is my scenario ... I have strings coming in from a list, each of which I want to check against a regular expression and see whether or not it qualifies. By that I mean I have a certain set of characters that are permissible and if the string has characters which are not permissible, I need to flag that string ... here is a snip ... flagged = list() strs = ['HELLO', 'Hi%20There', '123...@#@'] p = re.compile(r[^a-zA-Z0-9], re.UNICODE) for s in strs: if len(p.findall(s)) 0: flagged.append(s) print flagged my question is ... if I wanted to allow '%20' but not '%', how would my current regex (r[^a-zA-Z0-9]) be modified? The essence of the approach is to observe that each element is a sequence of zero or more character, where character is either letter/digit or escape. So you would use a pattern like ([a-zA-Z0-9]|%[0-9a-f][0-9a-f])+ regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: regex problem with re and fnmatch
Hi John, John Machin schrieb am 11/20/2007 09:40 PM: On Nov 21, 8:05 am, Fabian Braennstroem [EMAIL PROTECTED] wrote: Hi, I would like to use re to search for lines in a files with the word README_x.org, where x is any number. E.g. the structure would look like this: [[file:~/pfm_v99/README_1.org]] I tried to use these kind of matchings: #org_files='.*README\_1.org]]' org_files='.*README\_*.org]]' if re.match(org_files,line): First tip is to drop the leading '.*' and use search() instead of match(). The second tip is to use raw strings always for your patterns. Unfortunately, it matches all entries with README.org, but not the wanted number!? \_* matches 0 or more occurrences of _ (the \ is redundant). You need to specify one or more digits -- use \d+ or [0-9]+ The . in .org matches ANY character except a newline. You need to escape it with a \. pat = r'README_\d+\.org' re.search(pat, 'README.org') re.search(pat, 'README_.org') re.search(pat, 'README_1.org') _sre.SRE_Match object at 0x00B899C0 re.search(pat, 'README_.org') _sre.SRE_Match object at 0x00B899F8 re.search(pat, 'README_Zorg') Thanks a lot, works really nice! After some splitting and replacing I am able to check, if the above file exists. If it does not, I start to search for it using the 'walk' procedure: I presume that you mean something like: .. check if the above file exists in some directory. If it does not, I start to search for it somewhere else ... for root, dirs, files in os.walk(/home/fab/org): for name in dirs: dirs=os.path.join(root, name) + '/' The above looks rather suspicious ... for thing in container: container = something_else What are you trying to do? for name in files: files=os.path.join(root, name) and again if fnmatch.fnmatch(str(files), README*): Why str(name) ? print File Found print str(files) break fnmatch is not as capable as re; in particular it can't express one or more digits. To search a directory tree for the first file whose name matches a pattern, you need something like this: def find_one(top, pat): for root, dirs, files in os.walk(top): for fname in files: if re.match(pat + '$', fname): return os.path.join(root, fname) As soon as it finds the file, the file or a file??? Ummm ... aren't you trying to locate a file whose EXACT name you found in the first exercise?? def find_it(top, required): for root, dirs, files in os.walk(top): if required in files: return os.path.join(root, required) Great :-) Thanks a lot for your help... it can be so easy :-) Fabian -- http://mail.python.org/mailman/listinfo/python-list
regex problem with re and fnmatch
Hi, I would like to use re to search for lines in a files with the word README_x.org, where x is any number. E.g. the structure would look like this: [[file:~/pfm_v99/README_1.org]] I tried to use these kind of matchings: #org_files='.*README\_1.org]]' org_files='.*README\_*.org]]' if re.match(org_files,line): Unfortunately, it matches all entries with README.org, but not the wanted number!? After some splitting and replacing I am able to check, if the above file exists. If it does not, I start to search for it using the 'walk' procedure: for root, dirs, files in os.walk(/home/fab/org): for name in dirs: dirs=os.path.join(root, name) + '/' for name in files: files=os.path.join(root, name) if fnmatch.fnmatch(str(files), README*): print File Found print str(files) break As soon as it finds the file, it should stop the searching process; but there is the same matching problem like above. Does anyone have any suggestions about the regex problem? Greetings! Fabian -- http://mail.python.org/mailman/listinfo/python-list
Re: regex problem with re and fnmatch
On Nov 21, 8:05 am, Fabian Braennstroem [EMAIL PROTECTED] wrote: Hi, I would like to use re to search for lines in a files with the word README_x.org, where x is any number. E.g. the structure would look like this: [[file:~/pfm_v99/README_1.org]] I tried to use these kind of matchings: #org_files='.*README\_1.org]]' org_files='.*README\_*.org]]' if re.match(org_files,line): First tip is to drop the leading '.*' and use search() instead of match(). The second tip is to use raw strings always for your patterns. Unfortunately, it matches all entries with README.org, but not the wanted number!? \_* matches 0 or more occurrences of _ (the \ is redundant). You need to specify one or more digits -- use \d+ or [0-9]+ The . in .org matches ANY character except a newline. You need to escape it with a \. pat = r'README_\d+\.org' re.search(pat, 'README.org') re.search(pat, 'README_.org') re.search(pat, 'README_1.org') _sre.SRE_Match object at 0x00B899C0 re.search(pat, 'README_.org') _sre.SRE_Match object at 0x00B899F8 re.search(pat, 'README_Zorg') After some splitting and replacing I am able to check, if the above file exists. If it does not, I start to search for it using the 'walk' procedure: I presume that you mean something like: .. check if the above file exists in some directory. If it does not, I start to search for it somewhere else ... for root, dirs, files in os.walk(/home/fab/org): for name in dirs: dirs=os.path.join(root, name) + '/' The above looks rather suspicious ... for thing in container: container = something_else What are you trying to do? for name in files: files=os.path.join(root, name) and again if fnmatch.fnmatch(str(files), README*): Why str(name) ? print File Found print str(files) break fnmatch is not as capable as re; in particular it can't express one or more digits. To search a directory tree for the first file whose name matches a pattern, you need something like this: def find_one(top, pat): for root, dirs, files in os.walk(top): for fname in files: if re.match(pat + '$', fname): return os.path.join(root, fname) As soon as it finds the file, the file or a file??? Ummm ... aren't you trying to locate a file whose EXACT name you found in the first exercise?? def find_it(top, required): for root, dirs, files in os.walk(top): if required in files: return os.path.join(root, required) it should stop the searching process; but there is the same matching problem like above. HTH, John -- http://mail.python.org/mailman/listinfo/python-list
newb: Simple regex problem headache
import re s1 ='nbsp;25000nbsp;' s2 = 'nbsp;5.5910nbsp;' mypat = re.compile('[0-9]*(\.[0-9]*|$)') rate= mypat.search(s1) print rate.group() rate=mypat.search(s2) print rate.group() rate = mypat.search(s1) price = float(rate.group()) print price I get an error when it hits the whole number, that is in this format: s1 ='nbsp;25000nbsp;' For whole number s2, mypat catching empty string. I want it to give me 25000. I am getting this error: price = float(rate.group()) ValueError: empty string for float() Anyone knows, how I can get 25000 out of s2 = 'nbsp;5.5910nbsp;' using regex pattern, mypat = re.compile('[0-9]*(\.[0-9]*|$)'). mypat works fine for real numbers, but doesn't work for whole numbers. thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: newb: Simple regex problem headache
On Sep 21, 5:04 pm, crybaby [EMAIL PROTECTED] wrote: import re s1 ='nbsp;25000nbsp;' s2 = 'nbsp;5.5910nbsp;' mypat = re.compile('[0-9]*(\.[0-9]*|$)') rate= mypat.search(s1) print rate.group() rate=mypat.search(s2) print rate.group() rate = mypat.search(s1) price = float(rate.group()) print price I get an error when it hits the whole number, that is in this format: s1 ='nbsp;25000nbsp;' For whole number s2, mypat catching empty string. I want it to give me 25000. I am getting this error: price = float(rate.group()) ValueError: empty string for float() Anyone knows, how I can get 25000 out of s2 = 'nbsp;5.5910nbsp;' using regex pattern, mypat = re.compile('[0-9]*(\.[0-9]*|$)'). mypat works fine for real numbers, but doesn't work for whole numbers. thanks Your pattern matches the empty string a bit too well, if you know what I mean! Changing the regex slightly to '[0-9]+(\.[0-9]+)?' yields the results you want: 25000 5.5910 25000.0 -- http://mail.python.org/mailman/listinfo/python-list
Re: newb: Simple regex problem headache
crybaby wrote: import re s1 ='nbsp;25000nbsp;' s2 = 'nbsp;5.5910nbsp;' mypat = re.compile('[0-9]*(\.[0-9]*|$)') rate= mypat.search(s1) print rate.group() rate=mypat.search(s2) print rate.group() rate = mypat.search(s1) price = float(rate.group()) print price I get an error when it hits the whole number, that is in this format: s1 ='nbsp;25000nbsp;' For whole number s2, mypat catching empty string. I want it to give me 25000. I am getting this error: price = float(rate.group()) ValueError: empty string for float() Anyone knows, how I can get 25000 out of s2 = 'nbsp;5.5910nbsp;' using regex pattern, mypat = re.compile('[0-9]*(\.[0-9]*|$)'). mypat works fine for real numbers, but doesn't work for whole numbers. thanks Try this: import re s1 ='nbsp;25000nbsp;' s2 = 'nbsp;5.5910nbsp;' num_pat = re.compile(r'([0-9]+(\.[0-9]+)?)') num_pat.search(s1).group(1) '25000' num_pat.search(s2).group(1) '5.5910' Ian -- http://mail.python.org/mailman/listinfo/python-list
Re: newb: Simple regex problem headache
On Sep 21, 2007, at 4:04 PM, crybaby wrote: import re s1 ='nbsp;25000nbsp;' s2 = 'nbsp;5.5910nbsp;' mypat = re.compile('[0-9]*(\.[0-9]*|$)') rate= mypat.search(s1) print rate.group() rate=mypat.search(s2) print rate.group() rate = mypat.search(s1) price = float(rate.group()) print price I get an error when it hits the whole number, that is in this format: s1 ='nbsp;25000nbsp;' For whole number s2, mypat catching empty string. I want it to give me 25000. I am getting this error: price = float(rate.group()) ValueError: empty string for float() Anyone knows, how I can get 25000 out of s2 = 'nbsp;5.5910nbsp;' using regex pattern, mypat = re.compile('[0-9]*(\.[0-9]*|$)'). mypat works fine for real numbers, but doesn't work for whole numbers. I'm not sure what you just said makes a lot of sense, but if all your looking for is a regex that will match number strings with or without a decimal point, try '\d*\.?\d*' Erik Jones Software Developer | Emma® [EMAIL PROTECTED] 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate market in style. Visit us online at http://www.myemma.com -- http://mail.python.org/mailman/listinfo/python-list
regex problem
Hi all, line is am trying to match is 1959400|Q2BYK3|Q2BYK3_9GAMM Hypothetical outer membra29.90.00011 1 regex i have written is re.compile (r'(\d+?)\|((P|O|Q)\w{5})\|\w{3,6}\_\w{3,5}\s+?.{25}\s{3}(\d+?\.\d)\s+?(\d\.\d+?)') I am trying to extract 0.0011 value from the above line. why doesnt it match the group(4) item of the match ? any idea whats wrong with it ? regards, KM -- http://mail.python.org/mailman/listinfo/python-list
Re: regex problem
line is am trying to match is 1959400|Q2BYK3|Q2BYK3_9GAMM Hypothetical outer membra29.90.00011 1 regex i have written is re.compile (r'(\d+?)\|((P|O|Q)\w{5})\|\w{3,6}\_\w{3,5}\s+?.{25}\s{3}(\d+?\.\d)\s+?(\d\.\d+?)') I am trying to extract 0.0011 value from the above line. why doesnt it match the group(4) item of the match ? any idea whats wrong with it ? Well, your .{25}\s{3} portion only gets you to one space short of your 29.9, so your (\d+... fails to match 29.9 because there's an extra space there. My guess (from only one datum, so this could be /way/ off base) would be that you mean \s{4} or possibly \s{3,4} It seems like a very overconstrained regexp, but it might be just what you need to isolate the single line (or class of line) amongst the chaff of thousand others of similar form. -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: regex problem
HI Tim, oof! thats true! thanks a lot. Is there any tool to simplify building the regex ? regards, KM On 11/23/06, Tim Chase [EMAIL PROTECTED] wrote: line is am trying to match is 1959400|Q2BYK3|Q2BYK3_9GAMM Hypothetical outer membra29.90.00011 1 regex i have written is re.compile (r'(\d+?)\|((P|O|Q)\w{5})\|\w{3,6}\_\w{3,5}\s+?.{25}\s{3}(\d+?\.\d)\s+?(\d\.\d+?)') I am trying to extract 0.0011 value from the above line. why doesnt it match the group(4) item of the match ? any idea whats wrong with it ? Well, your .{25}\s{3} portion only gets you to one space short of your 29.9, so your (\d+... fails to match 29.9 because there's an extra space there. My guess (from only one datum, so this could be /way/ off base) would be that you mean \s{4} or possibly \s{3,4} It seems like a very overconstrained regexp, but it might be just what you need to isolate the single line (or class of line) amongst the chaff of thousand others of similar form. -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: regex problem
line is am trying to match is 1959400|Q2BYK3|Q2BYK3_9GAMM Hypothetical outer membra29.90.00011 1 regex i have written is re.compile (r'(\d+?)\|((P|O|Q)\w{5})\|\w{3,6}\_\w{3,5}\s+?.{25}\s{3}(\d+?\.\d)\s+?(\d\.\d+?)') I am trying to extract 0.0011 value from the above line. why doesnt it match the group(4) item of the match ? any idea whats wrong with it ? I am not expert about REs yet, but I suggest you to use the re.VERBOSE and split your RE in parts, like this: example = re.compile(r^ \s* # must start at the beginning + optional whitespaces ( [\[\(] ) # Group 1: opening bracket \s*# optional whitespaces ( [-+]? \d+ ) # Group 2: first number \s* , \s* # optional space + comma + optional whitespaces ( [-+]? \d+ ) # Group 3: second number \s*# optional whitespaces ( [\)\]] ) # Group 4: closing bracket \s* $ # optional whitespaces + must end at the end , flags=re.VERBOSE) This way you can debug and mantain it much better. Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: regex problem
On 2005-07-26, Duncan Booth [EMAIL PROTECTED] wrote: rx1=re.compile(r\b\d{4}(?:-\d{4})?,) rx1.findall(1234,-,4567,) ['1234,', '-,', '4567,'] Thanks all for good advice. However this last expression also matches the first four digits when the input is more than four digits. To resolve this problem, I first do a match of this, regex=re.compile(r\A(\b\d{4},|\d{4}-\d{4},)*(\b\d{4}|\d{4}-\d{4})\Z) If this turns out ok, I do a find all with your expression, and then I get the desired result. -- Har du et kjøleskap, har du en TV så har du alt du trenger for å leve -Jokke Valentinerne -- http://mail.python.org/mailman/listinfo/python-list
regex problem
Input is a string of four digit sequences, possibly separated by a -, for instance like this 1234,-,4567, My regular expression is like this: rx1=re.compile(r\A(\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,)*\Z) When running rx1.findall(1234,-,4567,) I only get the last match as the result. Isn't findall suppose to return all the matches? Thanks in advance. -- Har du et kjøleskap, har du en TV så har du alt du trenger for å leve -Jokke Valentinerne -- http://mail.python.org/mailman/listinfo/python-list
Re: regex problem
Am Tue, 26 Jul 2005 09:57:23 + schrieb Odd-R.: Input is a string of four digit sequences, possibly separated by a -, for instance like this 1234,-,4567, My regular expression is like this: rx1=re.compile(r\A(\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,)*\Z) Hi, try it without \A and \Z import re rx1=re.compile(r(\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,)) print rx1.findall(1234,-,4567,) # -- ['1234,', '-,', '4567,'] Thomas -- Thomas Güttler, http://www.thomas-guettler.de/ -- http://mail.python.org/mailman/listinfo/python-list
Re: regex problem
Odd-R. wrote: Input is a string of four digit sequences, possibly separated by a -, for instance like this 1234,-,4567, My regular expression is like this: rx1=re.compile(r\A(\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,)*\Z) When running rx1.findall(1234,-,4567,) I only get the last match as the result. Isn't findall suppose to return all the matches? For a start, an expression that starts with \A and ends with \Z will match the whole string (or not match at all). You have only one match. Secondly, as you have a group in your expression, findall returns what the group matches. Your expression matches zero or more of what your group matches, provided there is nothing else at the start/end of the string. The zero or more makes the re engine waltz about a bit; when the music stopped, the group was matching 4567,. Thirdly, findall should be thought of as merely a wrapper around a loop using the search method -- it finds all non-overlapping matches of a pattern. So the clue to get from this is that you need a really simple pattern, like the following. You *don't* have to write an expression that does the looping. So here's the mean lean no-flab version -- you don't even need the parentheses (sorry, Thomas). rx1=re.compile(r\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,) rx1.findall(1234,-,4567,) ['1234,', '-,', '4567,'] HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: regex problem
John Machin wrote: So here's the mean lean no-flab version -- you don't even need the parentheses (sorry, Thomas). rx1=re.compile(r\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,) rx1.findall(1234,-,4567,) ['1234,', '-,', '4567,'] No flab? What about all that repetition of \d? A less flabby version: rx1=re.compile(r\b\d{4}(?:-\d{4})?,) rx1.findall(1234,-,4567,) ['1234,', '-,', '4567,'] -- http://mail.python.org/mailman/listinfo/python-list
Re: regex problem
Duncan Booth wrote: John Machin wrote: So here's the mean lean no-flab version -- you don't even need the parentheses (sorry, Thomas). rx1=re.compile(r\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,) rx1.findall(1234,-,4567,) ['1234,', '-,', '4567,'] No flab? What about all that repetition of \d? A less flabby version: rx1=re.compile(r\b\d{4}(?:-\d{4})?,) rx1.findall(1234,-,4567,) ['1234,', '-,', '4567,'] OK, good idea to factor out the prefix and follow it by optional -1234. However optimising re engines do common prefix factoring, *and* they rewrite stuff like x{4} as . Cheers, John -- http://mail.python.org/mailman/listinfo/python-list