On 10/11/07, Dick Moores <[EMAIL PROTECTED]> wrote: > > At 02:06 PM 10/10/2007, Ian Witham wrote: > > > On 10/11/07, *Dick Moores* <[EMAIL PROTECTED]> wrote: > I think I could learn a lot about the use of Python with the web by > writing a script that would look at > < http://starship.python.net/crew/index.html> and find all the links > to more that just the default shown by this one: > < http://starship.python.net/crew/beazley/>. I think there should be > about 20 URLs in the list. But I need a start. So give me one? > > > > A start? Start with urllib2 in the standard library. > > Load the page source at < http://starship.python.net/crew/index.html> and > have your script create a list of all the URLs you wish to visit. > > Loop through that list, opening each URL. If the page source is different > from the standard "WAITING..." source then you can add that URL to a new > list of "good" URLS. > > > How about a hint of how to get those ">jcooley<" things from the source? > (I'm able to have the script get the source, using urllib2.) >
I've done a similar thing 'the hard way' I extracted the information I needed using split. In your case you really need the content between each (<a href=") and the following (") without brackets of course. so for your script you might use: (not tested) url_list = source.split('<a href="'>) url_list = [code_block.split('"')[0] for code_block in url_list[1:]] a look at the source will show that the first 3 and the last 1 url in the list are not relevant. Probably not the most efficient way to do it, but it worked for me. Maybe regular expressions would be a better way to do it. Ian.
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor