You could try: http://www.aminus.org/rbre/python/cleanhtml.py

YMMV, as the kids say.  But I did choose this over BeautifulSoup or
Strip-o-gram to do this particular thing.  I don't remember -why- I
chose it, but there you go.  Easy enough to test all three :)

Oh, and if you just want a whole page prettily formatted:

lynx -dump page.html > file.txt

is often the easiest way.

Good luck,

Andrew

On 10/14/05, Marc Buehler <[EMAIL PROTECTED]> wrote:
> hi.
>
> i have a ton of html files from which i want to
> extract the plain english words, and then write
> those words into a single text file.
>
> example:
> <html>
> <head>
> <... all kinds html tags ...>
> <font color=99cccc size=5>
> this is text
> </font>
>
> from the above, i want to extract the string
> 'this is text' and write it out to a text file.
> note that all of the html files have the same
> format, i.e. the text is always surrounded by the same
> html tags.
> also, i am sorting through thousands of
> html files, so whatever i do needs to be
> fast.
>
> any ideas?
>
> marc
>
>
> ---------------------------------------------------------------------------------------
> The apocalyptic vision of a criminally insane charismatic cult leader
>
>    http://www.marcbuehler.net
> ----------------------------------------------------------------------------------------
>
>
>
> __________________________________
> Yahoo! Music Unlimited
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
>
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to