On Thu, Apr 9, 2009 at 7:27 PM, Steve Lyskawa <steve.mck...@gmail.com> wrote: > I'm having a > problem with Beautiful Soup. I can get it to scrape off all the href links > on a web page but I am having problems selecting specific URI's from the > output supplied by Beautiful Soup. > What exactly is it returning to me and what command would I use to find that > out?
Generally it gives you Tag and NavigableString objects, or lists of same. To find out what something is, print its type: print type(x) > Do I have to take each line it give me and put it into a list before I > can, for example, get only certain URI's containing a certain string or use > the results to get the web page that the URI is referring to? > The pseudo code for what I am trying to do: > Get all URI's from web page that contain string "env.html" > Open the web page it is referring to. > Scrape selected information off of that page. If you want to get all URI's at once, then that would imply creating a list. You could also process URI's one-at-a-time. > I'm have problem with step #1. I can get all URI's but I can't see to get > re.compile to work right. If I could get it to give me the URI only without > tags or link description, that would be ideal. Something like this should get you started: soup = BeautifulSoup(<some text from a web page>) for anchor in soup.findAll('a', href=re.compile(r'env\.html')): print anchor['href'] That says, find all <a> tags whose 'href' attribute matches the regex 'env\.html'. The Tag object will be assigned to the anchor variable. Then the value of the 'href' attribute is printed. I find it very helpful with BS to experiment at the command line. It often takes a few tries to understand what it is giving you and how to get exactly what you want. Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor