On 10/11/07, Dick Moores <[EMAIL PROTECTED]> wrote:
>
>  At 02:06 PM 10/10/2007, Ian Witham wrote:
>
>
> On 10/11/07, *Dick Moores* <[EMAIL PROTECTED]> wrote:
>  I think I could learn a lot about the use of Python with the web by
> writing a script that would look at
> < http://starship.python.net/crew/index.html> and find all the links
> to more that just the default shown by this one:
> < http://starship.python.net/crew/beazley/>. I think there should be
> about 20 URLs in the list. But I need a start. So give me one?
>
>
>
> A start? Start with urllib2 in the standard library.
>
> Load the page source at < http://starship.python.net/crew/index.html> and
> have your script create a list of all the URLs you wish to visit.
>
> Loop through that list, opening each URL. If the page source is different
> from the standard "WAITING..." source then you can add that URL to a new
> list of "good" URLS.
>
>
> How about a hint of how to get those ">jcooley<" things from the source?
> (I'm able to have the script get the source, using urllib2.)
>


I've done a similar thing 'the hard way'
I extracted the information I needed using split.

In your case you really need the content between each (<a href=") and the
following (") without brackets of course.

so for your script you might use: (not tested)
url_list = source.split('<a href="'>)
url_list = [code_block.split('"')[0] for code_block in url_list[1:]]

a look at the source will show that the first 3 and the last 1 url in the
list are not relevant.
Probably not the most efficient way to do it, but it worked for me.

Maybe regular expressions would be a better way to do it.

Ian.
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to