sacha rook wrote: > Hi I wonder if anyone can help with the following > > I am trying to read a html page extract only fully qualified hostnames > from the page and output these hostnames to a file on disk to be used > later as input to another program. > > I have this so far > > import urllib2 > f=open("c:/tmp/newfile.txt", "w") > for line in urllib2.urlopen("_http://www.somedomain.uk_ > <http://www.somedomain.uk/>"): > if "href" in line and "http://" in line: > print line > f.write(line) > f.close() > fu=open("c:/tmp/newfile.txt", "r") > > for line in fu.readlines(): > print line > > so i have opened a file to write to, got a page of html, printed and > written those to file that contain href & http:// references. > closed file opened file read all the lines from file and printed out > > Can someone point me in right direction please on the flow of this > program, the best way to just extract the hostnames and print these to > file on disk?
I would start with a Regular Expression to match the text of the URL, it will match exactly the text of the URL and you can extract that. You can probably even find one in a web search. Read up on regular expressions to start with, they're extremely powerful, but a little bit of a learning curve to start with. Google "regular expression tutorial" or search the list archive for a reference. > > As you can see I am newish to this > > Thanks in advance for any help given! > > s > > ------------------------------------------------------------------------ > Do you know a place like the back of your hand? Share local knowledge > with BackOfMyHand.com <http://www.backofmyhand.com> > ------------------------------------------------------------------------ > > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor