Peter West <[email protected]> writes:

> [1:text/plain Show]
>
>
> [2:text/html Hide Save:noname (29kB)]
>
> If you have a JVM lying around, you can extract docx text with Apache Tika.

I use LibreOffice for that purpose. Not the most efficient, but I am
sure it covers it all and will update each time I update LibreOffice:

/usr/local/bin/soffice --headless --convert-to pdf $dir/message.raw
--outdir $dir

Once in a while LibreOffice process will hang, so I have a cron to
delete any such process older than 5 minutes.

Note that it converts the document to PDF, so I still have to do PDF
extraction afterward.

Best regards,

Olivier

>
> —
> Peter West
> [email protected]
> “I am the vine; you are the branches.”
>
>  On 7 May 2021, at 2:30 pm, John Hardin <[email protected]> wrote:
>
>  On Thu, 6 May 2021, Alex wrote:
>
>  Hi,
>
>  I'm trying to use the latest ExtractText plugin, but the docx2txt
>  program the plugin references is no longer available from
>  http://docx2txt.sourceforge.net
>
>  Do you have any recommendations for an alternative...?
>
>  Perhaps one of (from Stack Overflow):
>
>  unzip -p some.docx word/document.xml |\
>  sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
>
>  unzip -p document.docx word/document.xml |\
>  sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'
>
>  unzip -p document.docx word/document.xml |\
>  sed -e 's/<\/w:p>/ /g; s/<[^>]\{1,\}>/ /g; s/[^[:print:]]\{1,\}/ /g'
>
>  ...though html2text might be better than sed for reliably de-XMLizing the
>  document text.
>
>  There's also this:
>
>  http://abisource.com/downloads/wv/
>
>  There's conflicting information on whether Antiword groks .docx, you may
>  want to try it and see. It may be available from your distro, otherwise:
>
>  http://www.winfield.demon.nl/index.html
>
>  It might be worthwhile to use native perl utilities to unzip the file,
>  extract the document.xml content and pass it through XML::XPath to
>  extract the text, but that would probably involve code changes to
>  ExtractText rather than just configuring an it to use external utility.
>
>  Caveat: I have never looked at the ExtractText plugin.
>
>  -- 
>  John Hardin KA7OHZ http://www.impsec.org/~jhardin/
>  [email protected] pgpk -a [email protected]
>  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
>  -----------------------------------------------------------------------
>  Are you a mildly tech-literate politico horrified by the level of
>  ignorance demonstrated by lawmakers gearing up to regulate online
>  technology they don't even begin to grasp? Cool. Now you have a
>  tiny glimpse into a day in the life of a gun owner. -- Sean Davis
>  -----------------------------------------------------------------------
>  2 days until the 76th anniversary of VE day
>


/usr/local/bin/soffice --headless --convert-to pdf $di\
r/message.raw --outdir $dir
-- 

Reply via email to