On Fri, Jan 15, 2010 at 11:37:30AM -0800, Ken Krugler wrote: > > On Jan 15, 2010, at 11:27am, Doug Carter wrote: > > >On Fri, Jan 15, 2010 at 11:19:31AM -0800, Ken Krugler wrote: > >> > >>On Jan 15, 2010, at 11:07am, Doug Carter wrote: > >> > >>> > >>>Hi all, > >>> > >>>This may be off-topic for this list, but I need to start somewhere. > >>> > >>>I need a command line utility to do document format conversion, in a > >>>batch mode environment. The batch process is a combination of steps, > >>>one > >>>of which is the actual format conversion which is currently being > >>>done > >>>by a collection of Linux binary converters like wvWare, pdftohtml, > >>>etc. > >>> > >>>I've put a shell script wrapper around the tika jar: > >>> > >>>java -jar tika-app.jar [infile] > [outfile] > >>> > >>>This works OK, but as you would imagine, it is much slower > >>>compared to > >>>a Linux binary. > >>> > >>>Does anyone know of a way to improve the performance in a setup like > >>>this? I know it goes against the whole philosophy of Java, but is > >>>there > >>>a way to compile the Tika jar byte code into a native Linux binary? > >>>I've > >>>taken a look at gcj, but it doesn't look like a simple re-compile. > >>> > >>>Any ideas would be greatly appreciated. > >> > >>If you have a set of documents, easiest would be to pass in a > >>directory to tika-app (extend it a bit) so that one invocation of the > >>JVM processes many documents. > > > >Hi Ken, > > > >I've considered something like this (for the exact reason you stated) > >but I don't have that flexibility with my current setup. Each document > >needs to go through a series of processing steps, one of which is the > >format conversion. > > In that case, another cheesy solution is to have the Java process > watch a specific directory. Whenever a new file (with the appropriate > name format) appears, it gets processed. This Java process then > continues to run indefinitely as a kind of processing daemon. > > You can avoid hand-off problems by using a name pattern, and renaming > the file when it's really ready for processing. > > There are lots of cleaner, more sophisticated systems involving > notification systems, queues, RESTful services, etc. which might be > more appropriate, depending on your needs.
Interesting approach. Thanks for the idea. Doug