Hello,

On 21 Jul 2023 at 23:51:54, Cristian Zamfir <[email protected]> wrote:

> Hi Tim!
>
> Sorry for the lack of details, adding now.
>
> On 21 Jul 2023 at 18:56:02, Tim Allison <[email protected]> wrote:
>
>> Sorry, I'm not sure I understand precisely what's going on.
>>
>> First, are you running tika-server, tika-app, tika-async, or running Tika
>> programmatically?  I'm guessing tika-server because you've containerized
>> it, but I've containerized tika-async...so...? 😃
>>
>
> Tika-server, the official docker image with a custom config - the config’s
> main changes are the -Xmx arg.
>
>
>> If tika-server, are you sending requests in parallel to each container?
>> If in parallel, how many parallel requests are you allowing?
>>
>
> Yes, sending requests in parallel without managing the number of requests
> in parallel - there is horizontal auto-scaling to deal deal with load, but
> the number of replicas is not based on the queue size, rather on CPU
> consumption. Is there a recommended concurrency level? I could use that
> instead for HPA. More on that below.
>
>
>>  Are you able to share with me (privately) an example specific file that
>> is causing problems?
>>
>
> Unfortunately no, and I do not have access to the files either for
> security reasons, not logging them on purpose. That would have been the
> first thing I would have tried too.
>
>
>> >where despite setting a watchdog to limit the heap to 3GB.
>> You're setting your own watchdog?  Or, is this tika-server's watchdog and
>> you've set -Xmx3g in <forkedJvmArgs>?
>>
>
> Using -Xmx3g.
>
>
>
>> >1. The JVM is slow to observe the forked process exceeding its heap and
>> does not terminate it in time
>> Again, your own watchdog? If tika-server's watchdog...possibly?  I
>> haven't seen this behavior, but it doesn't mean that it can't happen.
>>
>
> I guess there are no known SLAs from Tika’s watchdog kicking in, this is
> what I was asking. I don’t know how it is implemented.
>
>
>> >2. It's not the heap that grows, but there is some stack overflow due to
>> very deep recursion.
>> Possible, but I don't think so... the default -Xss isn't very deep.
>> Perhaps I misunderstand the suggestion?
>>
>
> I think we are on the same page - I was thinking what non-heap sources
> could account for the memory usage.
>
>
>
>> >Finally, are there any file types that are known to use a lot of memory
>> with Tika?
>> A file from any of the major file formats can be, um, crafted to take up
>> a lot of memory. My rule of thumb is to allow 2GB per thread (if running
>> multithreaded) or request if you're allowing concurrent requests of
>> tika-server.  There will still be some files that cause tika to OOM if
>> you're processing millions/billions of files from the wild.
>>
>
> This happens quite often with regular files, not crafted inputs. I am
> guessing there is a particular memory hungry file format and several of
> them are handled in parallel.
>
> With 2GB per request and heap size of 3GB that would mean very few
> concurrent requests, so not great efficiency. Most of the time in my
> experience Tika can process lots of files in parallel with a 3GB heap.
>
> I noticed this message also appears quite often:
> org.apache.tika.utils.XMLReaderUtils Contention waiting for a SAXParser.
> Consider increasing the XMLReaderUtils.POOL_SIZE
> I am guessing this means the number of requests handled in parallel is
> exceeding a certain internal limit.
>
>
> Now that I understand this better, I have some followup questions:
>
>    1. Is there concurrency control I can configure, to limit the number
>    of incoming requests handled in parallel?
>
>
I looked at the tika server options and I did not see an option for
concurrency control.
Actually I found a reply from Nicholas to my older question where I
understood that Tika Pipes may be the answer
https://www.mail-archive.com/[email protected]/msg03535.html

The main question is if the latest Tika server implementation uses pipes by
default or is another solution recommended.

Thanks,
Cristi



>    1. Assuming the answer is “yes" above, will requests be queued when
>    the limit is reached? If they are dropped, is there a busy status reply to
>    the /tika API?
>    2. Is the queue size or the number of concurrently parsed files
>    exposed through an API?
>
>
>
> To turn it around, are there specific file types that you are noticing are
>> causing OOM?
>>
>
> I will have to look into obtaining analytics on the input, maybe that will
> shed more light.
>
> Thanks!
>
>
>> On Thu, Jul 20, 2023 at 6:20 PM Cristian Zamfir <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I am seeing some cases with Tika 2.2.1 where despite setting a watchdog
>>> to limit the heap to 3GB, the entire Tika container exceeds 6GB and that
>>> exceeds the resource memory limit, so it gets OOM-ed. Here is one example:
>>>
>>> total-vm:8109464kB, anon-rss:99780kB, file-rss:28204kB, shmem-rss:32kB,
>>> UID:0 pgtables:700kB oom_score_adj:-997
>>>
>>> Only some files seem to be causing this behavior.
>>>
>>> The memory ramps up fairly quickly, in a few tens of seconds it can go
>>> from 1GB to 6GB.
>>>
>>> The next step is to check if this goes away with 2.8.0, but I wonder if
>>> any of the following explanations make any sense:
>>> 1. The JVM is slow to observe the forked process exceeding its heap and
>>> does not terminate it in time
>>> 2. It's not the heap that grows, but there is some stack overflow due to
>>> very deep recursion.
>>>
>>> Finally, are there any file types that are known to use a lot of memory
>>> with Tika?
>>>
>>> Thanks,
>>> Cristi
>>>
>>>

Reply via email to