"Shriphani Palakodety" <[EMAIL PROTECTED]> wrote in
> I have a html document here which goes like this:
>
> <A name=4></a><b>Table of Contents</b>
> .........
> <A name=5></a><b>Preface</b>
>
> Can someone tell me how I can get the string between the <b> tag for
> an a tag for a given value of the name attribute.
Heres an example using the standard library HTML parser
(from an unfinished topic in tutorial...). You could also
use BeautifulSoup and I recommend that if your needs get
any more complex...
----------------------------------------------
In practice we usually want to extract more specific data from a page,
maybe the content of a particular row in a table or similar. For that
we need to use the handle_starttag() and handle_endtag() methods. As
an example let's extract the text of the second H1 level header:
html = '''
<html><head><title>Test page</title></head>
<body>
<center>
<h1>Here is the first heading</h1>
</center>
<p>A short paragraph
<h1>A second heading</h1>
<p>A paragraph containing a
<a href="www.google.com">hyperlink to google</a>
</body></html>
'''
from HTMLParser import HTMLParser
class H1Parser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.h1_count = 0
self.isHeading = False
def handle_starttag(self,tag,attributes=None):
if tag == 'h1':
self.h1_count += 1
self.isHeading = True
def handle_endtag(self,tag):
if tag == 'h1':
self.isHeading = False
def handle_data(self,data):
if self.isHeading and self.h1_count == 2:
print "Second Header contained: ", data
parser = H1Parser()
parser.feed(html)
parser.close()
------------------------------Hopefully you can see how to alter that
pattern to suit your scenario.-- Alan GauldAuthor of the Learn to
Program web sitehttp://www.freenetpages.co.uk/hp/alan.gauld
_______________________________________________
Tutor maillist - [email protected]
http://mail.python.org/mailman/listinfo/tutor