Hi all, I'm new to Python and am trying to pass a config file to my Python script. The config file is so simple and has only two URLs.
The code should takes that configuration file as input and generates a single file in HTML format as output. The program must retrieve each web page in the list and extract all the <a> tag links from each page. It is only necessary to extract the <a> tag links from the landing page of the URLs that you have placed in your configuration file. The program will output an HTML file containing a list of clickable links from the source webpages and will be grouped by webpage. This is what I came up with so far, can someone please tell me if it's good? Thanks in advance. [CODE] - - - - - - - - config.txt - - - - - - - - http://www.blahblah.bla http://www.etcetc.etc - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - linkscraper.py - - - - - - - - import urllib def get_seed_links(): ...."""return dict with seed links, from the config file, as keys -- {seed_link: None, ... }""" ....with open("config.txt", "r") as f: ........seed_links = f.read().split('\n') ....return dict([(s_link, None) for s_link in seed_links]) def get_all_links(seed_link): ...."""return list of links from seed_link page""" ....all_links = [] ....source_page = urllib.urlopen(seed_link).read() ....start = 0 ....while True: ........start = source_page.find("<a", start) ........if start == -1: ............return all_links ........start = source_page.find("href=", start) ........start = source_page.find("=", start) + 1 ........end = source_page.find(" ", start) ........link = source_page[start:end] ........all_links.append(link) def build_output_file(data): ...."""build and save output file from data. data -- {seed_link:[link, ...], ...}""" ....result = "" ....for seed_link in data: ........result += "<h2>%s</h2>\n<break />" % seed_link ........for link in data[seed_link]: ............result += '<a href="%s">%s</a>\n' % (link, link.replace("http://", "")) ........result += "<html /><html />" ....with open("result.htm", "w") as f: ........f.write(result) def main(): ....seed_link_data = get_seed_links() ....for seed_link in seed_link_data: ........seed_link_data[seed_link] = get_all_links(seed_link) ....build_output_file(seed_link_data) if __name__ == "__main__": ....main() [/CODE] _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor