On 01/25/2011 04:52 PM, Juan Jose Del Toro wrote:
> Dear List;
> 
> I am looking for a way to extract parts of a text from word (.doc,.docx)
> files as well as pdf; the idea is to walk through the whole directory tree
> and populate a csv file with an excerpt from each file.
> For PDF I found PyPdf <http://pybrary.net/pyPdf/>ave found nothing to read
> doc, docx
> 
> 
> 
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor

A docx file is a compressed XML file (or groups of files). I don't know
if there is a python module for it, but you could probably whip up your
own. I know 7z on Windows will extract a .docx (probably anything can if
you point to it, not sure). From there you'll need to explore the
structure and how Microsoft decided to use XML. ElementTree would
probably be useful here. Not sure about a doc file, a simple dd of a doc
file shows some garbage (probably useful for formatting ;-) as well as
the text. I found
http://code.activestate.com/recipes/279003-converting-word-documents-to-text/
.
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to