Re: [Tutor] Passing a config file to Python

Dave Angel Thu, 14 Mar 2013 15:53:54 -0700

On 03/14/2013 02:22 PM, Irina I wrote:

Hi all,


I'm new to Python and am trying to pass a config file to my Python script. The 
config file is so simple and has only two URLs.

The code should takes that configuration file as input and generates a single 
file in HTML format as output.

The program must retrieve each web page in the list and extract all the <a> tag links 
from each page. It is only necessary to extract the <a> tag links from the landing 
page of the URLs that you have placed in your configuration file.

The program will output an HTML file containing a list of clickable links from 
the source webpages and will be grouped by webpage. This is what I came up with 
so far, can someone please tell me if it's good?

Thanks in advance.

[CODE]

- - - - - - - - config.txt - - - - - - - -
http://www.blahblah.bla
http://www.etcetc.etc
- - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - linkscraper.py - - - - - - - -
import urllib

def get_seed_links():
...."""return dict with seed links, from the config file, as keys -- {seed_link: None, ... 
}"""
....with open("config.txt", "r") as f:
........seed_links = f.read().split('\n')

readline() is much clearer and accomplishes what you want. Of coursethen you'd have to move the newline from each line. But generally whenyou're reading in manually entered data, you want to do a strip() oneach line anyway.

....return dict([(s_link, None) for s_link in seed_links])

def get_all_links(seed_link):
...."""return list of links from seed_link page"""
....all_links = []
....source_page = urllib.urlopen(seed_link).read()
....start = 0
....while True:
........start = source_page.find("<a", start)
........if start == -1:
............return all_links
........start = source_page.find("href=", start)
........start = source_page.find("=", start) + 1
........end = source_page.find(" ", start)
........link = source_page[start:end]
........all_links.append(link)

def build_output_file(data):
...."""build and save output file from data. data -- {seed_link:[link, ...], 
...}"""
....result = ""
....for seed_link in data:
........result += "<h2>%s</h2>\n<break />" % seed_link


Perhaps by 'break' you really meant 'b' ??

........for link in data[seed_link]:
............result += '<a href="%s">%s</a>\n' % (link, link.replace("http://";, 
""))
........result += "<html /><html />"

You have no DOCTYPE header in your output file. The html tag pair needto surround the bulk of the file, not consist of a one-space content.

You have no header and body section.

....with open("result.htm", "w") as f:
........f.write(result)

def main():
....seed_link_data = get_seed_links()
....for seed_link in seed_link_data:
........seed_link_data[seed_link] = get_all_links(seed_link)
....build_output_file(seed_link_data)

if __name__ == "__main__":
....main()

[/CODE]

You never specify which version of Python this is written for, nor whatconstraints there are on either the input html or output html. Somecomments are omitted, since they're version dependent.

Generally, your code is fragile as to what actual web pages wouldactually work. Few websites actually try very hard to have valid html,and even much valid html could break your current assumptions. ConsiderBeautiful Soup instead of urllib or urllib2.

Your source code would have to be carefully edited to change all thoseleading periods into spaces before it could even compile in Python.That stops any of us from actually trying it, or pieces of it. So wecan only comment by inspection.





--
DaveA
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Passing a config file to Python

Reply via email to