On 01/25/2011 04:52 PM, Juan Jose Del Toro wrote: > Dear List; > > I am looking for a way to extract parts of a text from word (.doc,.docx) > files as well as pdf; the idea is to walk through the whole directory tree > and populate a csv file with an excerpt from each file. > For PDF I found PyPdf <http://pybrary.net/pyPdf/>ave found nothing to read > doc, docx > > > > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor
A docx file is a compressed XML file (or groups of files). I don't know if there is a python module for it, but you could probably whip up your own. I know 7z on Windows will extract a .docx (probably anything can if you point to it, not sure). From there you'll need to explore the structure and how Microsoft decided to use XML. ElementTree would probably be useful here. Not sure about a doc file, a simple dd of a doc file shows some garbage (probably useful for formatting ;-) as well as the text. I found http://code.activestate.com/recipes/279003-converting-word-documents-to-text/ . _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor