Hello, On 21 Jul 2023 at 23:51:54, Cristian Zamfir <[email protected]> wrote:
> Hi Tim! > > Sorry for the lack of details, adding now. > > On 21 Jul 2023 at 18:56:02, Tim Allison <[email protected]> wrote: > >> Sorry, I'm not sure I understand precisely what's going on. >> >> First, are you running tika-server, tika-app, tika-async, or running Tika >> programmatically? I'm guessing tika-server because you've containerized >> it, but I've containerized tika-async...so...? 😃 >> > > Tika-server, the official docker image with a custom config - the config’s > main changes are the -Xmx arg. > > >> If tika-server, are you sending requests in parallel to each container? >> If in parallel, how many parallel requests are you allowing? >> > > Yes, sending requests in parallel without managing the number of requests > in parallel - there is horizontal auto-scaling to deal deal with load, but > the number of replicas is not based on the queue size, rather on CPU > consumption. Is there a recommended concurrency level? I could use that > instead for HPA. More on that below. > > >> Are you able to share with me (privately) an example specific file that >> is causing problems? >> > > Unfortunately no, and I do not have access to the files either for > security reasons, not logging them on purpose. That would have been the > first thing I would have tried too. > > >> >where despite setting a watchdog to limit the heap to 3GB. >> You're setting your own watchdog? Or, is this tika-server's watchdog and >> you've set -Xmx3g in <forkedJvmArgs>? >> > > Using -Xmx3g. > > > >> >1. The JVM is slow to observe the forked process exceeding its heap and >> does not terminate it in time >> Again, your own watchdog? If tika-server's watchdog...possibly? I >> haven't seen this behavior, but it doesn't mean that it can't happen. >> > > I guess there are no known SLAs from Tika’s watchdog kicking in, this is > what I was asking. I don’t know how it is implemented. > > >> >2. It's not the heap that grows, but there is some stack overflow due to >> very deep recursion. >> Possible, but I don't think so... the default -Xss isn't very deep. >> Perhaps I misunderstand the suggestion? >> > > I think we are on the same page - I was thinking what non-heap sources > could account for the memory usage. > > > >> >Finally, are there any file types that are known to use a lot of memory >> with Tika? >> A file from any of the major file formats can be, um, crafted to take up >> a lot of memory. My rule of thumb is to allow 2GB per thread (if running >> multithreaded) or request if you're allowing concurrent requests of >> tika-server. There will still be some files that cause tika to OOM if >> you're processing millions/billions of files from the wild. >> > > This happens quite often with regular files, not crafted inputs. I am > guessing there is a particular memory hungry file format and several of > them are handled in parallel. > > With 2GB per request and heap size of 3GB that would mean very few > concurrent requests, so not great efficiency. Most of the time in my > experience Tika can process lots of files in parallel with a 3GB heap. > > I noticed this message also appears quite often: > org.apache.tika.utils.XMLReaderUtils Contention waiting for a SAXParser. > Consider increasing the XMLReaderUtils.POOL_SIZE > I am guessing this means the number of requests handled in parallel is > exceeding a certain internal limit. > > > Now that I understand this better, I have some followup questions: > > 1. Is there concurrency control I can configure, to limit the number > of incoming requests handled in parallel? > > I looked at the tika server options and I did not see an option for concurrency control. Actually I found a reply from Nicholas to my older question where I understood that Tika Pipes may be the answer https://www.mail-archive.com/[email protected]/msg03535.html The main question is if the latest Tika server implementation uses pipes by default or is another solution recommended. Thanks, Cristi > 1. Assuming the answer is “yes" above, will requests be queued when > the limit is reached? If they are dropped, is there a busy status reply to > the /tika API? > 2. Is the queue size or the number of concurrently parsed files > exposed through an API? > > > > To turn it around, are there specific file types that you are noticing are >> causing OOM? >> > > I will have to look into obtaining analytics on the input, maybe that will > shed more light. > > Thanks! > > >> On Thu, Jul 20, 2023 at 6:20 PM Cristian Zamfir <[email protected]> >> wrote: >> >>> Hi, >>> >>> I am seeing some cases with Tika 2.2.1 where despite setting a watchdog >>> to limit the heap to 3GB, the entire Tika container exceeds 6GB and that >>> exceeds the resource memory limit, so it gets OOM-ed. Here is one example: >>> >>> total-vm:8109464kB, anon-rss:99780kB, file-rss:28204kB, shmem-rss:32kB, >>> UID:0 pgtables:700kB oom_score_adj:-997 >>> >>> Only some files seem to be causing this behavior. >>> >>> The memory ramps up fairly quickly, in a few tens of seconds it can go >>> from 1GB to 6GB. >>> >>> The next step is to check if this goes away with 2.8.0, but I wonder if >>> any of the following explanations make any sense: >>> 1. The JVM is slow to observe the forked process exceeding its heap and >>> does not terminate it in time >>> 2. It's not the heap that grows, but there is some stack overflow due to >>> very deep recursion. >>> >>> Finally, are there any file types that are known to use a lot of memory >>> with Tika? >>> >>> Thanks, >>> Cristi >>> >>>
