This was exactly what I was afraid of...you see, I have to extract thousands and thousands of documents and calling java command *three times* for each of them is highly inefficient. I want to keep tika in memory somehow and in a single VM, not to instantiate new VM every time I need to extract something. That's why running tika-server is almost ideal for me - yes, I have to decompress ZIP/TAR first, but I get everything in a single call which works much faster.
Any suggestions how to wrap tika (app) to extract everything in one call and to stay in a single VM so HotSpot can perform optimizations? I guess something in between tika-app and tika-server... Thank you. On Thu, Aug 7, 2014 at 5:32 PM, Nick Burch <[email protected]> wrote: > On Thu, 7 Aug 2014, Bratislav Stojanovic wrote: > >> Hmm, I apologize, but I'm afraid this does not work. If you specify : >> >> *java -jar tika-app-1.5-SNAPSHOT.jar --text --metadata --extract >> --extract-dir=out example.doc* >> >> >> ...it will only extract attachments, not everything (text + meta + >> attachments). Any flags I'm missing? >> > > With the Tika App, you'll need to run it three times, once for text, once > for metadata, once for embedded resource extraction > > If you want to do all 3 in one go, you'll need to write a few lines of Java > > Nick > -- Bratislav Stojanovic, M.Sc.
