Oswaldo Martinez wrote: > OK before I got in to the loop in the script I decided to try first with one > file and I have some doubts with the some parts in the script,plus I got an > error: > > >>>>import re >>>>file = open("file1.html") >>>>data = file.read() >>>>catRe = re.compile(r'<strong>Title:</strong>(.*?)<br><strong>')
Thi regex does not agree with the data you originally posted. Your original data was <strong>Category:</strong>Category1<br><br> Do you see the difference? Your regex has a different ending. > > > # I searched around the docs on regexes I have and found that the "r" #after > the re.compile(' will detect repeating words.Why is this useful in #my case? > I want to read the whole string even if it has repeating words. #Also, I > dont understand the actual regex (.*?) . If I want to match #everything > inside </strong> and <br><strong> , shouldn`t I just put a "*" > # ? I tried that and it gave me an error of course. As Danny said, the r is not part of the regex, it marks a 'raw' string. In this case it is not needed but I use it always for regex strings out of habit. The whole string is the regex, not just the (.*?) part. Most of it just matches against fixed text. The part in parenthesis says . match anything * match 0 or more of the previous character, i.e. 0 or more of anything ? match non-greedy - match the minimum number of characters to make the whole match succeed. Without this, the .* could match the whole file up to the *last* <br><strong> which is not what you want! The parentheses create a group which you can use to pull out the part of the string which matched inside them. This is the data you want. > > >>>>m = catRe.search(data) >>>>category = m.group(1) > > Traceback (most recent call last): > File "<stdin>", line 1, in ? > AttributeError: 'NoneType' object has no attribute 'group' In this case the match failed, so m is None and m.group(1) gives an error. > > > I also found that on some of the strings I want to extract, when python > reads them using file.read(), there are newline characters and other stuff > that doesn`t show up in the actual html source.Do I have to take these in to > account in the regex or will it automatically include them? This will only be a problem if the newlines are in the text you are actually trying to match. Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor