Hi all,

I'm new to Python and am trying to pass a config file to my Python script. The 
config file is so simple and has only two URLs.

The code should takes that configuration file as input and generates a single 
file in HTML format as output.

The program must retrieve each web page in the list and extract all the <a> tag 
links from each page. It is only necessary to extract the <a> tag links from 
the landing page of the URLs that you have placed in your configuration file.

The program will output an HTML file containing a list of clickable links from 
the source webpages and will be grouped by webpage. This is what I came up with 
so far, can someone please tell me if it's good? 

Thanks in advance.

[CODE]

- - - - - - - - config.txt - - - - - - - -
http://www.blahblah.bla
http://www.etcetc.etc
- - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - linkscraper.py - - - - - - - -
import urllib

def get_seed_links():
...."""return dict with seed links, from the config file, as keys -- 
{seed_link: None, ... }"""
....with open("config.txt", "r") as f:
........seed_links = f.read().split('\n')
....return dict([(s_link, None) for s_link in seed_links])

def get_all_links(seed_link):
...."""return list of links from seed_link page"""
....all_links = []
....source_page = urllib.urlopen(seed_link).read()
....start = 0
....while True:
........start = source_page.find("<a", start)
........if start == -1:
............return all_links
........start = source_page.find("href=", start)
........start = source_page.find("=", start) + 1
........end = source_page.find(" ", start)
........link = source_page[start:end]
........all_links.append(link)

def build_output_file(data):
...."""build and save output file from data. data -- {seed_link:[link, ...], 
...}"""
....result = ""
....for seed_link in data:
........result += "<h2>%s</h2>\n<break />" % seed_link
........for link in data[seed_link]:
............result += '<a href="%s">%s</a>\n' % (link, link.replace("http://";, 
""))
........result += "<html /><html />"
....with open("result.htm", "w") as f:
........f.write(result)

def main():
....seed_link_data = get_seed_links()
....for seed_link in seed_link_data:
........seed_link_data[seed_link] = get_all_links(seed_link)
....build_output_file(seed_link_data)

if __name__ == "__main__":
....main()

[/CODE]

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to