Re: [Tutor] extracting text from word files (.doc, .docx) and pdf

Corey Richardson Tue, 25 Jan 2011 14:46:16 -0800

On 01/25/2011 04:52 PM, Juan Jose Del Toro wrote:
> Dear List;
> 
> I am looking for a way to extract parts of a text from word (.doc,.docx)
> files as well as pdf; the idea is to walk through the whole directory tree
> and populate a csv file with an excerpt from each file.
> For PDF I found PyPdf <http://pybrary.net/pyPdf/>ave found nothing to read
> doc, docx
> 
> 
> 
> _______________________________________________
> Tutor maillist  -  [email protected]
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor


A docx file is a compressed XML file (or groups of files). I don't know
if there is a python module for it, but you could probably whip up your
own. I know 7z on Windows will extract a .docx (probably anything can if
you point to it, not sure). From there you'll need to explore the
structure and how Microsoft decided to use XML. ElementTree would
probably be useful here. Not sure about a doc file, a simple dd of a doc
file shows some garbage (probably useful for formatting ;-) as well as
the text. I found
http://code.activestate.com/recipes/279003-converting-word-documents-to-text/
.
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] extracting text from word files (.doc, .docx) and pdf

Reply via email to