I've done some testing of Tika to determine how performant the JAXRS server is under heavy loads by making 4-8 simultaneous requests as fast as the webservice would respond, using a variety of test documents. (Some of these document types were supported by Tika, some weren't.) I have a large text extraction job coming up--millions of docs--and I needed to determine what kind of resources I would need. During this testing, I found that CPU usage was highest when Tika was unwinding exceptions. This CPU usage would persist long after my ~10GB of documents had been completed.
These stack traces appeared to pile up such that documents would continue to be processed as requests were made, and Tika would opportunistically print a stack trace when it wasn't busy responding to other work. These stack traces would scroll by--often for several minutes--after I had finished making requests. I didn't dig into the cause because when I began filtering the document types I was sending it, performance got better, and dramatically reduced the number of exceptions thrown. As you might expect, this brought CPU (and memory!) usage down dramatically. With that in mind: - Have you captured any console output? - How busy is your web service? - Are you filtering the document types before they're processed? - Can you reproduce the problem in a test environment? -Rian On Sat, Dec 21, 2013 at 1:02 PM, Nate Findley <[email protected]> wrote: > I am running Tika Server for processing files via curl requests. The > servers start running 100 CPU after a day or so. I am wondering if there > is any information about how to debug this situation. The wiki is pretty > thin on information. > > Regards, > Nate >
