On Wed, Sep 26, 2018 at 10:25:29AM +1000, Piers wrote: > I have a VM with 16GB RAM and 8 Cores. It's job is to accept a HTTP request > using PHP, take the data (documents) and run open office to convert it to > plain text and return it. Up until now this process has been fine. It seems > that the server hangs as it just gets overloaded.
What kind of document files are they? If they're OO or LO to begin with, you may be able to extract what you need from them without running open office - OO files are zip files containing a bunch of XML files, with the most interesting/useful one being "content.xml" If you don't need the entire file converted and just want some specific bits of data from it, you could use xmlstarlet[1] or similar XML parsing tool to extract data from the unzipped xml file(s). Don't try to use regular expressions for this, REs are not capable of robustly extracting data from XML (you can hack something up that can be made to work with the particular sample data files you're testing with, but it will be extremely fragile and even minor changes in the structure of the input file can and will break your script), use and XML parser. If you really insist on not using an XML parser then at least use xml2[2] to convert the XML to a line-oriented format suitable for use as input to sed/grep/awk/etc. Or write a perl or python script to do it. Both languages have libraries for working with open document files, zip files, xml files, etc. There are also libs for working with various versions of microsoft's office file formats. Other languages may also have similarly useful libraries. Also worth looking at, pandoc[3] is capable of converting between many different text document formats (incl. odt, docx, various flavours of markdown, html, rst, plain text and others). If the files are pdf then you could use pdftotxt from poppler-utils[4]. [1] http://xmlstar.sourceforge.net/ [2] AFAIK, xml2 doesn't currently have a home page, and hasn't had one for years. It's packaged for Debian and probably other distros and if you need the source code, your nearest debian mirror is probably the best place to find it. https://tracker.debian.org/pkg/xml2 [3] https://pandoc.org/ [4] http://poppler.freedesktop.org/ > I've just changed the cron jobs from the sending servers to space them apart > and also upped max / min spare servers. > > Previously I have tried other approaches like the JAVA headless and running > OO as services and using HA Proxy. They haven't been successful (could have > been my implementations). > > The script loads its own version of OO each time a connection is made. Is > there a better way of doing this? Seems like an awfully big VM to falls > over/hang all the time. > > Your help / ideas / rants are appreciated. Running OO to convert files seems like overkill unless there's no other option - many file formats have small, standalone tools for converting them to other formats and/or extracting data from them. And several more have libraries for reading and/or writing them in common languages like perl or python. This is almost always a better option than using a heavyweight process like OO or LO to do the conversion. If converting with OO is the only option, then I'd suggest using Libre Office instead of Apache Open Office. While Open Office still gets some development effort and attention, what little it gets is completely overshadowed by developer effort on LO. LO is years ahead of OO - by comparison, OO is effectively abandonware. craig -- craig sanders <c...@taz.net.au> _______________________________________________ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main