On Jan 15, 2010, at 11:27am, Doug Carter wrote:

On Fri, Jan 15, 2010 at 11:19:31AM -0800, Ken Krugler wrote:

On Jan 15, 2010, at 11:07am, Doug Carter wrote:


Hi all,

This may be off-topic for this list, but I need to start somewhere.

I need a command line utility to do document format conversion, in a
batch mode environment. The batch process is a combination of steps,
one
of which is the actual format conversion which is currently being done
by a collection of Linux binary converters like wvWare, pdftohtml,
etc.

I've put a shell script wrapper around the tika jar:

java -jar tika-app.jar [infile] > [outfile]

This works OK, but as you would imagine, it is much slower compared to
a Linux binary.

Does anyone know of a way to improve the performance in a setup like
this? I know it goes against the whole philosophy of Java, but is
there
a way to compile the Tika jar byte code into a native Linux binary?
I've
taken a look at gcj, but it doesn't look like a simple re-compile.

Any ideas would be greatly appreciated.

If you have a set of documents, easiest would be to pass in a
directory to tika-app (extend it a bit) so that one invocation of the
JVM processes many documents.

Hi Ken,

I've considered something like this (for the exact reason you stated)
but I don't have that flexibility with my current setup. Each document
needs to go through a series of processing steps, one of which is the
format conversion.

In that case, another cheesy solution is to have the Java process watch a specific directory. Whenever a new file (with the appropriate name format) appears, it gets processed. This Java process then continues to run indefinitely as a kind of processing daemon.

You can avoid hand-off problems by using a name pattern, and renaming the file when it's really ready for processing.

There are lots of cleaner, more sophisticated systems involving notification systems, queues, RESTful services, etc. which might be more appropriate, depending on your needs.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to