You could try: http://www.aminus.org/rbre/python/cleanhtml.py
YMMV, as the kids say. But I did choose this over BeautifulSoup or Strip-o-gram to do this particular thing. I don't remember -why- I chose it, but there you go. Easy enough to test all three :) Oh, and if you just want a whole page prettily formatted: lynx -dump page.html > file.txt is often the easiest way. Good luck, Andrew On 10/14/05, Marc Buehler <[EMAIL PROTECTED]> wrote: > hi. > > i have a ton of html files from which i want to > extract the plain english words, and then write > those words into a single text file. > > example: > <html> > <head> > <... all kinds html tags ...> > <font color=99cccc size=5> > this is text > </font> > > from the above, i want to extract the string > 'this is text' and write it out to a text file. > note that all of the html files have the same > format, i.e. the text is always surrounded by the same > html tags. > also, i am sorting through thousands of > html files, so whatever i do needs to be > fast. > > any ideas? > > marc > > > --------------------------------------------------------------------------------------- > The apocalyptic vision of a criminally insane charismatic cult leader > > http://www.marcbuehler.net > ---------------------------------------------------------------------------------------- > > > > __________________________________ > Yahoo! Music Unlimited > Access over 1 million songs. Try it free. > http://music.yahoo.com/unlimited/ > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor