On Wed, Apr 8, 2009 at 6:28 AM, David Cash <[email protected]> wrote: > Hi, I'm new to python and have decided to develop a web crawler / file > downloader as my first application. I am at the stage where the script > requests a page and parses the page for URLs, then prints them out. However, > I'd like to change my current regex that greps for 'http' to one that will > grep for the url variable that is used in the connect string. > > I was hoping I could use something like p=re.compile((url).*?'"') but this > is clearly not the right syntax. Apologies for such a newbie question!
We live for newbie questions :-) I'm not too sure what you want to do. In general, if you have a string in a variable and you want to include that string in a regex, you should build a new string, then compile that. In your case, the way to do what you asked for is p = re.compile(url + ".*?") But I don't think this will do anything useful, for a couple of reasons. It finds the exact URL followed by any text. So the first match will match from the url to the end of the text. If your URL has a path component - for example http://some.domain.com/index.html - then the regex will not find other URLs in the domain, such as http://some.domain.com/good/stuff/index.html. You should look at BeautifulSoup, it is an add-on module that parses HTML and makes it easy to extract links. You also might be interested in the urlparse module, which has functions which break up a URL into components. http://personalpages.tds.net/~kent37/kk/00009.html # Intro to BeautifulSoup http://docs.python.org/library/urlparse.html Kent _______________________________________________ Tutor maillist - [email protected] http://mail.python.org/mailman/listinfo/tutor
