Peter West <[email protected]> writes: > [1:text/plain Show] > > > [2:text/html Hide Save:noname (29kB)] > > If you have a JVM lying around, you can extract docx text with Apache Tika.
I use LibreOffice for that purpose. Not the most efficient, but I am sure it covers it all and will update each time I update LibreOffice: /usr/local/bin/soffice --headless --convert-to pdf $dir/message.raw --outdir $dir Once in a while LibreOffice process will hang, so I have a cron to delete any such process older than 5 minutes. Note that it converts the document to PDF, so I still have to do PDF extraction afterward. Best regards, Olivier > > — > Peter West > [email protected] > “I am the vine; you are the branches.” > > On 7 May 2021, at 2:30 pm, John Hardin <[email protected]> wrote: > > On Thu, 6 May 2021, Alex wrote: > > Hi, > > I'm trying to use the latest ExtractText plugin, but the docx2txt > program the plugin references is no longer available from > http://docx2txt.sourceforge.net > > Do you have any recommendations for an alternative...? > > Perhaps one of (from Stack Overflow): > > unzip -p some.docx word/document.xml |\ > sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' > > unzip -p document.docx word/document.xml |\ > sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g' > > unzip -p document.docx word/document.xml |\ > sed -e 's/<\/w:p>/ /g; s/<[^>]\{1,\}>/ /g; s/[^[:print:]]\{1,\}/ /g' > > ...though html2text might be better than sed for reliably de-XMLizing the > document text. > > There's also this: > > http://abisource.com/downloads/wv/ > > There's conflicting information on whether Antiword groks .docx, you may > want to try it and see. It may be available from your distro, otherwise: > > http://www.winfield.demon.nl/index.html > > It might be worthwhile to use native perl utilities to unzip the file, > extract the document.xml content and pass it through XML::XPath to > extract the text, but that would probably involve code changes to > ExtractText rather than just configuring an it to use external utility. > > Caveat: I have never looked at the ExtractText plugin. > > -- > John Hardin KA7OHZ http://www.impsec.org/~jhardin/ > [email protected] pgpk -a [email protected] > key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 > ----------------------------------------------------------------------- > Are you a mildly tech-literate politico horrified by the level of > ignorance demonstrated by lawmakers gearing up to regulate online > technology they don't even begin to grasp? Cool. Now you have a > tiny glimpse into a day in the life of a gun owner. -- Sean Davis > ----------------------------------------------------------------------- > 2 days until the 76th anniversary of VE day > /usr/local/bin/soffice --headless --convert-to pdf $di\ r/message.raw --outdir $dir --
