On Fri, Jan 15, 2010 at 11:19:31AM -0800, Ken Krugler wrote: > > On Jan 15, 2010, at 11:07am, Doug Carter wrote: > > > > >Hi all, > > > >This may be off-topic for this list, but I need to start somewhere. > > > >I need a command line utility to do document format conversion, in a > >batch mode environment. The batch process is a combination of steps, > >one > >of which is the actual format conversion which is currently being done > >by a collection of Linux binary converters like wvWare, pdftohtml, > >etc. > > > >I've put a shell script wrapper around the tika jar: > > > > java -jar tika-app.jar [infile] > [outfile] > > > >This works OK, but as you would imagine, it is much slower compared to > >a Linux binary. > > > >Does anyone know of a way to improve the performance in a setup like > >this? I know it goes against the whole philosophy of Java, but is > >there > >a way to compile the Tika jar byte code into a native Linux binary? > >I've > >taken a look at gcj, but it doesn't look like a simple re-compile. > > > >Any ideas would be greatly appreciated. > > If you have a set of documents, easiest would be to pass in a > directory to tika-app (extend it a bit) so that one invocation of the > JVM processes many documents.
Hi Ken, I've considered something like this (for the exact reason you stated) but I don't have that flexibility with my current setup. Each document needs to go through a series of processing steps, one of which is the format conversion. Thanks for idea though. Doug