On 03/14/2013 02:22 PM, Irina I wrote:
Hi all,

I'm new to Python and am trying to pass a config file to my Python script. The 
config file is so simple and has only two URLs.

The code should takes that configuration file as input and generates a single 
file in HTML format as output.

The program must retrieve each web page in the list and extract all the <a> tag links 
from each page. It is only necessary to extract the <a> tag links from the landing 
page of the URLs that you have placed in your configuration file.

The program will output an HTML file containing a list of clickable links from 
the source webpages and will be grouped by webpage. This is what I came up with 
so far, can someone please tell me if it's good?

Thanks in advance.

[CODE]

- - - - - - - - config.txt - - - - - - - -
http://www.blahblah.bla
http://www.etcetc.etc
- - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - linkscraper.py - - - - - - - -
import urllib

def get_seed_links():
...."""return dict with seed links, from the config file, as keys -- {seed_link: None, ... 
}"""
....with open("config.txt", "r") as f:
........seed_links = f.read().split('\n')

readline() is much clearer and accomplishes what you want. Of course then you'd have to move the newline from each line. But generally when you're reading in manually entered data, you want to do a strip() on each line anyway.

....return dict([(s_link, None) for s_link in seed_links])

def get_all_links(seed_link):
...."""return list of links from seed_link page"""
....all_links = []
....source_page = urllib.urlopen(seed_link).read()
....start = 0
....while True:
........start = source_page.find("<a", start)
........if start == -1:
............return all_links
........start = source_page.find("href=", start)
........start = source_page.find("=", start) + 1
........end = source_page.find(" ", start)
........link = source_page[start:end]
........all_links.append(link)

def build_output_file(data):
...."""build and save output file from data. data -- {seed_link:[link, ...], 
...}"""
....result = ""
....for seed_link in data:
........result += "<h2>%s</h2>\n<break />" % seed_link

Perhaps by 'break' you really meant 'b' ??

........for link in data[seed_link]:
............result += '<a href="%s">%s</a>\n' % (link, link.replace("http://";, 
""))
........result += "<html /><html />"

You have no DOCTYPE header in your output file. The html tag pair need to surround the bulk of the file, not consist of a one-space content.
You have no header and body section.

....with open("result.htm", "w") as f:
........f.write(result)

def main():
....seed_link_data = get_seed_links()
....for seed_link in seed_link_data:
........seed_link_data[seed_link] = get_all_links(seed_link)
....build_output_file(seed_link_data)

if __name__ == "__main__":
....main()

[/CODE]


You never specify which version of Python this is written for, nor what constraints there are on either the input html or output html. Some comments are omitted, since they're version dependent.

Generally, your code is fragile as to what actual web pages would actually work. Few websites actually try very hard to have valid html, and even much valid html could break your current assumptions. Consider Beautiful Soup instead of urllib or urllib2.

Your source code would have to be carefully edited to change all those leading periods into spaces before it could even compile in Python. That stops any of us from actually trying it, or pieces of it. So we can only comment by inspection.




--
DaveA
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to