Hi Tim and Nicholas, Thanks for your answers.
On Wed, Jun 23, 2021 at 4:17 PM Nicholas DiPiazza < [email protected]> wrote: > > - how many connections will it accept before not accepting new > connections? > > You will not hit the jetty max request limits. Rather you will hit CPU > saturation or out-of-memory conditions that will happen far before that. > Indeed, I am sure that in most circumstances I will hit either CPU saturation or OOM, but the question is at which point does a busy server stop accepting new connections. > > > how many files can be scanned in parallel? > > Totally depends on the files you are parsing. Empirical analysis is the > only way to tell. > I see, I was thinking that maybe there is a clear limit on the number of connections (i.e., the jetty max request limits you mentioned above). In this case it would make sense to return status code 429 for example. > > > what is the return code to expect when there is contention on the server? > > You will get timeouts when there is too much contention and you'll see the > tika spawned servers will keep restarting if you are making them OOM with > too many files. And CPU contention will show in just sluggishness and > failure to respond. No error codes. > Understood. > > > - is it a safe assumption that for connections to be dropped, CPU will > be saturated? > > 1) cpu saturation, 2) OOM, 3) infinite loops due to parser bug can happen. > > The naive solution is to just turn your timeout to something reasonable > like 30 - 60 seconds, then retry documents. > > I went through all of this for many years, then recently last year i > changed to a new much more successful strategy: > > Kubernetes/Docker - have many tika-server instances deployed in 2CPU/4G > kubernetes pods > > Then add these server urls to a resource pool, where each thread that > needs to parse checks out a server, then checks it back in. > > By having each thread have it's very own tika server, it prevents issues > where threadA threw in an Excel document that causes an OOM error, then > blew up all the active parses for threadB, C, D, E, F. etc > Actually I am doing something similar with GKE horizontal auto-scaling. However, having a tika-server per document sounds expensive, at least the reserved memory won't be used very well, but most likely also CPU --- so there are still multiple docs being scanned by each tika-sever. IMO it's fine to retry all the active parsers in case one blows up, since errors are not very frequent. > > Then we created Tika Pipes in Tika 2.0.0 to do this in a more graceful > way. Where you create a Fetch/Emit pipeline, then you push files that you > want to parse into a queue then they are asynchronously parsed and the > parsed output is emitted after completed. > Sounds good and I am using Tika 2.0.0 - is this behavior used by default in the corresponding Tika server? > > > When I unbury from a bunch of unrelated work, I hope to have a youtube > video and corresponding wiki article that show show Tika pipes works. It's > likely exactly what you are looking for. I'll make sure to send you a link > to that when done. > Thanks for that, but at the moment I intend to stick to the Tika server, so I hope this is used out of the box by Tika server. > > > On Wed, Jun 23, 2021 at 6:47 AM Tim Allison <[email protected]> wrote: > >> >> Sorry… Sergey Beryozkin >> >> On Wed, Jun 23, 2021 at 6:46 AM Tim Allison <[email protected]> wrote: >> >>> Hi Cristi, >>> >>> I regret that I don't have precise answers for these questions. >>> tika-server uses Apache cxf and most of your questions are handled at >>> that level. There is no logic in Tika for number of connections, >>> identifying contention or even keeping track of the number of parallel >>> requests. >>> >>> If you're running in --spawnChild mode in 1.x or running in default >>> in 2.x, the server can go down and drop connections if a file has >>> caused a catastrophic problem (timeout, oom or other crash), but that >>> doesn't necessarily mean that CPU will be saturated. >>> >>> In practice, I've found that it is better to run multiple >>> tika-servers (on different ports?) and have one tika-server per client >>> so that you effectively avoid multithreading...this also enables you >>> to know which file caused a catastrophic problem. If you're running >>> multiple requests on a single server, and one of the files causes a >>> shutdown/restart, you won't know which of the active files caused the >>> problem. >>> >>> Nicholas DiPiazza has experience with pegging tika-servers. He >>> might be willing to chime in? >>> >>> Sergey Beryokin is our cxf expert...he might have better insight on >>> the cxf layer. >>> >>> The above input applies to the standard /tika, /rmeta endpoints. >>> The new pipes /pipes and /async handlers fork multiple sub-processes >>> and do the parsing there. I have not yet experimented with >>> overwhelming them in practice/production, but the /async handler at >>> least has a return value for "queue is full, please don't send any >>> more requests". >>> >> Tim, this is exactly what I was hoping is implemented in Tika server as well. If the "queue is full" could translate into a 429 return code for the client, that would be great. Thanks! Cristi > >>> Best, >>> >>> Tim >>> >>> On Tue, Jun 22, 2021 at 3:28 AM Cristian Zamfir <[email protected]> >>> wrote: >>> > >>> > Hello, please let me know if somebody has looked into this or I should >>> look at the source code instead? Thanks! >>> > >>> > On Fri, Jun 18, 2021 at 5:04 PM Cristian Zamfir <[email protected]> >>> wrote: >>> >> >>> >> Hi, >>> >> >>> >> I have a few questions about the concurrency level of tika-server in >>> the default configuration: >>> >> - how many connections will it accept before not accepting new >>> connections? >>> >> - how many files can be scanned in parallel? >>> >> - what is the return code to expect when there is contention on the >>> server? >>> >> - is it a safe assumption that for connections to be dropped, CPU >>> will be saturated? >>> >> >>> >> Thanks, >>> >> Cristi >>> >> >>> >>
