Re: [Tutor] extract plain english words from html

Kent Johnson Fri, 14 Oct 2005 17:55:42 -0700

Marc Buehler wrote:
> hi.
> 
> i have a ton of html files from which i want to
> extract the plain english words, and then write
> those words into a single text file.


If you just want the text from a single tag in the document then BeautifulSoup 
will work well, as Danny and Bob suggest. If you have many tags containing text 
and you want all the text, you might like StripOGram
http://www.zope.org/Members/chrisw/StripOGram

or this succinct example from Python Cookbook 2nd edition:
from sgmllib import SGMLParser
class XMLJustText(SGMLParser):
    def handle_data(self, data):
        print data
XMLJustText().feed(open('text.xml').read())

Kent

> 
> example:
> <html>
> <head>
> <... all kinds html tags ...>
> <font color=99cccc size=5>
> this is text
> </font>
> 
> from the above, i want to extract the string 
> 'this is text' and write it out to a text file.
> note that all of the html files have the same 
> format, i.e. the text is always surrounded by the same
> html tags.
> also, i am sorting through thousands of
> html files, so whatever i do needs to be
> fast.
> 
> any ideas?
> 
> marc
> 
> 
> ---------------------------------------------------------------------------------------
> The apocalyptic vision of a criminally insane charismatic cult leader 
> 
>    http://www.marcbuehler.net
> ----------------------------------------------------------------------------------------
> 
> 
>               
> __________________________________ 
> Yahoo! Music Unlimited 
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
> _______________________________________________
> Tutor maillist  -  [email protected]
> http://mail.python.org/mailman/listinfo/tutor
> 
> 

_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] extract plain english words from html

Reply via email to