Re: Apache + Open Office Headless

Craig Sanders via luv-main Tue, 25 Sep 2018 23:58:42 -0700

On Wed, Sep 26, 2018 at 10:25:29AM +1000, Piers wrote:
> I have a VM with 16GB RAM and 8 Cores. It's job is to accept a HTTP request
> using PHP, take the data (documents) and run open office to convert it to
> plain text and return it. Up until now this process has been fine. It seems
> that the server hangs as it just gets overloaded.


What kind of document files are they?

If they're OO or LO to begin with, you may be able to extract what you need
from them without running open office - OO files are zip files containing a
bunch of XML files, with the most interesting/useful one being "content.xml"

If you don't need the entire file converted and just want some specific bits
of data from it, you could use xmlstarlet[1] or similar XML parsing tool
to extract data from the unzipped xml file(s).  Don't try to use regular
expressions for this, REs are not capable of robustly extracting data from
XML (you can hack something up that can be made to work with the particular
sample data files you're testing with, but it will be extremely fragile and
even minor changes in the structure of the input file can and will break
your script), use and XML parser.  If you really insist on not using an XML
parser then at least use xml2[2] to convert the XML to a line-oriented format
suitable for use as input to sed/grep/awk/etc.

Or write a perl or python script to do it. Both languages have libraries for
working with open document files, zip files, xml files, etc.  There are also
libs for working with various versions of microsoft's office file formats.
Other languages may also have similarly useful libraries.


Also worth looking at, pandoc[3] is capable of converting between many
different text document formats (incl. odt, docx, various flavours of
markdown, html, rst, plain text and others).


If the files are pdf then you could use pdftotxt from poppler-utils[4].



[1] http://xmlstar.sourceforge.net/

[2] AFAIK, xml2 doesn't currently have a home page, and hasn't had one for
years. It's packaged for Debian and probably other distros and if you need the
source code, your nearest debian mirror is probably the best place to find it.

https://tracker.debian.org/pkg/xml2

[3] https://pandoc.org/

[4] http://poppler.freedesktop.org/


> I've just changed the cron jobs from the sending servers to space them apart
> and also upped max / min spare servers.
>
> Previously I have tried other approaches like the JAVA headless and running
> OO as services and using HA Proxy. They haven't been successful (could have
> been my implementations).
>
> The script loads its own version of OO each time a connection is made. Is
> there a better way of doing this? Seems like an awfully big VM to falls
> over/hang all the time.
>
> Your help / ideas / rants are appreciated.

Running OO to convert files seems like overkill unless there's no other option
- many file formats have small, standalone tools for converting them to other
formats and/or extracting data from them.  And several more have libraries for
reading and/or writing them in common languages like perl or python.  This is
almost always a better option than using a heavyweight process like OO or LO
to do the conversion.

If converting with OO is the only option, then I'd suggest using Libre Office
instead of Apache Open Office.  While Open Office still gets some development
effort and attention, what little it gets is completely overshadowed by
developer effort on LO.  LO is years ahead of OO - by comparison, OO is
effectively abandonware.

craig

--
craig sanders <c...@taz.net.au>
_______________________________________________
luv-main mailing list
luv-main@luv.asn.au
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main

Re: Apache + Open Office Headless

Reply via email to