Hi Tim!

Sorry for the lack of details, adding now.

On 21 Jul 2023 at 18:56:02, Tim Allison <[email protected]> wrote:

> Sorry, I'm not sure I understand precisely what's going on.
>
> First, are you running tika-server, tika-app, tika-async, or running Tika
> programmatically?  I'm guessing tika-server because you've containerized
> it, but I've containerized tika-async...so...? 😃
>

Tika-server, the official docker image with a custom config - the config’s
main changes are the -Xmx arg.


> If tika-server, are you sending requests in parallel to each container?
> If in parallel, how many parallel requests are you allowing?
>

Yes, sending requests in parallel without managing the number of requests
in parallel - there is horizontal auto-scaling to deal deal with load, but
the number of replicas is not based on the queue size, rather on CPU
consumption. Is there a recommended concurrency level? I could use that
instead for HPA. More on that below.


>  Are you able to share with me (privately) an example specific file that
> is causing problems?
>

Unfortunately no, and I do not have access to the files either for security
reasons, not logging them on purpose. That would have been the first thing
I would have tried too.


> >where despite setting a watchdog to limit the heap to 3GB.
> You're setting your own watchdog?  Or, is this tika-server's watchdog and
> you've set -Xmx3g in <forkedJvmArgs>?
>

Using -Xmx3g.



> >1. The JVM is slow to observe the forked process exceeding its heap and
> does not terminate it in time
> Again, your own watchdog? If tika-server's watchdog...possibly?  I haven't
> seen this behavior, but it doesn't mean that it can't happen.
>

I guess there are no known SLAs from Tika’s watchdog kicking in, this is
what I was asking. I don’t know how it is implemented.


> >2. It's not the heap that grows, but there is some stack overflow due to
> very deep recursion.
> Possible, but I don't think so... the default -Xss isn't very deep.
> Perhaps I misunderstand the suggestion?
>

I think we are on the same page - I was thinking what non-heap sources
could account for the memory usage.



> >Finally, are there any file types that are known to use a lot of memory
> with Tika?
> A file from any of the major file formats can be, um, crafted to take up a
> lot of memory. My rule of thumb is to allow 2GB per thread (if running
> multithreaded) or request if you're allowing concurrent requests of
> tika-server.  There will still be some files that cause tika to OOM if
> you're processing millions/billions of files from the wild.
>

This happens quite often with regular files, not crafted inputs. I am
guessing there is a particular memory hungry file format and several of
them are handled in parallel.

With 2GB per request and heap size of 3GB that would mean very few
concurrent requests, so not great efficiency. Most of the time in my
experience Tika can process lots of files in parallel with a 3GB heap.

I noticed this message also appears quite often:
org.apache.tika.utils.XMLReaderUtils Contention waiting for a SAXParser.
Consider increasing the XMLReaderUtils.POOL_SIZE
I am guessing this means the number of requests handled in parallel is
exceeding a certain internal limit.


Now that I understand this better, I have some followup questions:

   1. Is there concurrency control I can configure, to limit the number of
   incoming requests handled in parallel?
   2. Assuming the answer is “yes" above, will requests be queued when the
   limit is reached? If they are dropped, is there a busy status reply to the
   /tika API?
   3. Is the queue size or the number of concurrently parsed files exposed
   through an API?



To turn it around, are there specific file types that you are noticing are
> causing OOM?
>

I will have to look into obtaining analytics on the input, maybe that will
shed more light.

Thanks!


> On Thu, Jul 20, 2023 at 6:20 PM Cristian Zamfir <[email protected]>
> wrote:
>
>> Hi,
>>
>> I am seeing some cases with Tika 2.2.1 where despite setting a watchdog
>> to limit the heap to 3GB, the entire Tika container exceeds 6GB and that
>> exceeds the resource memory limit, so it gets OOM-ed. Here is one example:
>>
>> total-vm:8109464kB, anon-rss:99780kB, file-rss:28204kB, shmem-rss:32kB,
>> UID:0 pgtables:700kB oom_score_adj:-997
>>
>> Only some files seem to be causing this behavior.
>>
>> The memory ramps up fairly quickly, in a few tens of seconds it can go
>> from 1GB to 6GB.
>>
>> The next step is to check if this goes away with 2.8.0, but I wonder if
>> any of the following explanations make any sense:
>> 1. The JVM is slow to observe the forked process exceeding its heap and
>> does not terminate it in time
>> 2. It's not the heap that grows, but there is some stack overflow due to
>> very deep recursion.
>>
>> Finally, are there any file types that are known to use a lot of memory
>> with Tika?
>>
>> Thanks,
>> Cristi
>>
>>

Reply via email to