Steve Lyskawa wrote: > I am not a programmer by trade but I've been using Python for 10+ years, > usually for text file conversion and protocol analysis. I'm having a > problem with Beautiful Soup. I can get it to scrape off all the href links > on a web page but I am having problems selecting specific URI's from the > output supplied by Beautiful Soup. > What exactly is it returning to me and what command would I use to find that > out? Do I have to take each line it give me and put it into a list before I > can, for example, get only certain URI's containing a certain string or use > the results to get the web page that the URI is referring to? > > The pseudo code for what I am trying to do: > > Get all URI's from web page that contain string "env.html" > Open the web page it is referring to. > Scrape selected information off of that page.
That's very easy to do with lxml.html, which offers an iterlinks() method on elements to iterate over all links in a document (not only a-href, but also in stylesheets, for example). It can parse directly from a URL, so you don't need to go through urllib and friends, and it can make links in a document absolute before iterating over them, so that relative links will work for you are doing. http://codespeak.net/lxml/lxmlhtml.html#working-with-links Also, you should use the urlparse module to split the URL (in case it contains parameters etc.) and check only the path section. Stefan _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor