In my case - initially at least - the tika server would be on the same physical server as the application needing to extract text from the documents that are uploaded to it. So network traffic is not so much an issue.
The main advantages I can see are: 1. Speed - the server is up and running all the time, so can process a document immediately. Obviously with many requests coming fast, then they could get backed up in a queue, but I'm hoping that queue would clear faster. 2. Memory usage. By running the server, the memory usage can be more easily controlled. It would use memory all the time it was running, but that would be in a process completely independent of the web application that needs the documents processed. If the web application needed to run a command line script every time, with a 25M JAR file (before it is decompressed) and a Java run-time, and the document being processed in memory, then I can see all sorts of memory issues getting in the way of its operation. -- Jason On 01/07/2012 13:28, Jukka Zitting wrote: > Hi, > > On Sun, Jul 1, 2012 at 2:17 PM, Mark Kerzner <[email protected]> wrote: >> Out of curiosity, what would be the performance benefit of server vs >> initialising every time? > You replace JVM startup overhead with that of a transmitting the > document over a network connection. How that affects overall system > performance depends on your deployment details. > > BR, > > Jukka Zitting
