Irina I > Hi all, > > I'm new to Python and am trying to pass a config file to my Python script. > The config file is so > simple and has only two URLs. > > The code should takes that configuration file as input and generates a single > file in HTML format as > output. > > The program must retrieve each web page in the list and extract all the <a> > tag links from each page. > It is only necessary to extract the <a> tag links from the landing page of > the URLs that you have > placed in your configuration file. > > The program will output an HTML file containing a list of clickable links > from the source webpages and > will be grouped by webpage. This is what I came up with so far, can someone > please tell me if it's > good? > > Thanks in advance. >
I would advise you to use a library like Beautiful Soup to parse the HTML. You will find lots of badly formed pages and trying to manually search for tags is error prone and frustrating. It is nice to see that you are not using regular expressions, so your solution will be much easier to debug. > [CODE] > > - - - - - - - - config.txt - - - - - - - - > http://www.blahblah.bla > http://www.etcetc.etc > - - - - - - - - - - - - - - - - - - - - - - > > - - - - - - - - linkscraper.py - - - - - - - - > import urllib > > def get_seed_links(): > ...."""return dict with seed links, from the config file, as keys -- > {seed_link: None, ... }""" > ....with open("config.txt", "r") as f: > ........seed_links = f.read().split('\n') seed_links = f.readlines() # This is more descriptive with the same effect. > ....return dict([(s_link, None) for s_link in seed_links]) > > def get_all_links(seed_link): > ...."""return list of links from seed_link page""" > ....all_links = [] > ....source_page = urllib.urlopen(seed_link).read() > ....start = 0 > ....while True: > ........start = source_page.find("<a", start) > ........if start == -1: > ............return all_links > ........start = source_page.find("href=", start) > ........start = source_page.find("=", start) + 1 Why keep using find? (Ignoring that I think you should use a true HTML parser) start = source_page.find("href=", start) + 7 # + len("href=") + delimiter > ........end = source_page.find(" ", start) What about links with a space in them? I have seen it before. What if the link does not have a space between URL and closing tag? <a href="google.com"/> > ........link = source_page[start:end] Does this remove the ending quote? > ........all_links.append(link) > > def build_output_file(data): > ...."""build and save output file from data. data -- {seed_link:[link, ...], > ...}""" > ....result = "" > ....for seed_link in data: > ........result += "<h2>%s</h2>\n<break />" % seed_link > ........for link in data[seed_link]: I think this would be better written using .iteritems() (Python 2) or .items() (Python3) for seed_link, links in data.iteritems(): result += "<h2>%s</h2>\n<break />" % seed_link for link in links: > ............result += '<a href="%s">%s</a>\n' % (link, > link.replace("http://", "")) > ........result += "<html /><html />" > ....with open("result.htm", "w") as f: > ........f.write(result) In general string concatenation in this manner is a bad idea because it is a quadratic process. (It takes a lot more time and memory.) The python idiom is to use '<delimiter>'.join(<list of strings>). Make result a list and then do a `result.append(<some string>)` instead of `result += <some string>`. When you finally need the string, do a `f.write(''.join(result))`. I have given three examples below to illustrate how the ''.join() idiom works. >>> '!#$#'.join( [ 'a', 'B', '4' ] ) 'a!#$#B!#$#4' >>> ''.join( [ 'a', 'B', '4' ] ) 'aB4' >>> '_'.join( [ 'a', 'B', '4' ] ) 'a_B_4' > > def main(): > ....seed_link_data = get_seed_links() > ....for seed_link in seed_link_data: > ........seed_link_data[seed_link] = get_all_links(seed_link) > ....build_output_file(seed_link_data) > > if __name__ == "__main__": > ....main() > > [/CODE] ~Ramit This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor