This was exactly what I was afraid of...you see, I have to extract
thousands and thousands of documents and calling java
command *three times* for each of them is highly inefficient. I want to
keep tika in memory somehow and in a single VM,
not to instantiate new VM every time I need to extract something. That's
why running tika-server is almost ideal for me - yes,
I have to decompress ZIP/TAR first, but I get everything in a single call
which works much faster.

Any suggestions how to wrap tika (app) to extract everything in one call
and to stay in a single VM so HotSpot can perform
optimizations? I guess something in between tika-app and tika-server...

Thank you.

On Thu, Aug 7, 2014 at 5:32 PM, Nick Burch <[email protected]> wrote:

> On Thu, 7 Aug 2014, Bratislav Stojanovic wrote:
>
>> Hmm, I apologize, but I'm afraid this does not work. If you specify :
>>
>> *java -jar tika-app-1.5-SNAPSHOT.jar --text --metadata --extract
>> --extract-dir=out example.doc*
>>
>>
>> ...it will only extract attachments, not everything (text + meta +
>> attachments). Any flags I'm missing?
>>
>
> With the Tika App, you'll need to run it three times, once for text, once
> for metadata, once for embedded resource extraction
>
> If you want to do all 3 in one go, you'll need to write a few lines of Java
>
> Nick
>



-- 
Bratislav Stojanovic, M.Sc.

Reply via email to