Hi Tim and Nicholas,

Thanks for your answers.

On Wed, Jun 23, 2021 at 4:17 PM Nicholas DiPiazza <
[email protected]> wrote:

>  > - how many connections will it accept before not accepting new
> connections?
>
> You will not hit the jetty max request limits. Rather you will hit CPU
> saturation or out-of-memory conditions that will happen far before that.
>

Indeed, I am sure that in most circumstances I will hit either CPU
saturation or OOM, but the question is at which point does a busy server
stop accepting new connections.


>
> > how many files can be scanned in parallel?
>
> Totally depends on the files you are parsing. Empirical analysis is the
> only way to tell.
>

I see, I was thinking that maybe there is a clear limit on the number of
connections (i.e., the jetty max request limits you mentioned above). In
this case it would make sense to return status code 429 for example.


>
> > what is the return code to expect when there is contention on the server?
>
> You will get timeouts when there is too much contention and you'll see the
> tika spawned servers will keep restarting if you are making them OOM with
> too many files. And CPU contention will show in just sluggishness and
> failure to respond. No error codes.
>

Understood.


>
> > - is it a safe assumption that for connections to be dropped, CPU will
> be saturated?
>
> 1) cpu saturation, 2) OOM, 3) infinite loops due to parser bug can happen.
>
> The naive solution is to just turn your timeout to something reasonable
> like 30 - 60 seconds, then retry documents.
>
> I went through all of this for many years, then recently last year i
> changed to a new much more successful strategy:
>
> Kubernetes/Docker - have many tika-server instances deployed in 2CPU/4G
> kubernetes pods
>
> Then add these server urls to a resource pool, where each thread that
> needs to parse checks out a server, then checks it back in.
>
> By having each thread have it's very own tika server, it prevents issues
> where threadA threw in an Excel document that causes an OOM error, then
> blew up all the active parses for threadB, C, D, E, F. etc
>

Actually I am doing something similar with GKE horizontal auto-scaling.
However, having a tika-server per document sounds expensive, at least the
reserved memory won't be used very well, but most likely also CPU --- so
there are still multiple docs being scanned by each tika-sever. IMO it's
fine to retry all the active parsers in case one blows up, since errors are
not very frequent.


>
> Then we created Tika Pipes in Tika 2.0.0 to do this in a more graceful
> way. Where you create a Fetch/Emit pipeline, then you push files that you
> want to parse into a queue then they are asynchronously parsed and the
> parsed output is emitted after completed.
>

Sounds good and I am using Tika 2.0.0 - is this behavior used by default in
the corresponding Tika server?


>
>
> When I unbury from a bunch of unrelated work, I hope to have a youtube
> video and corresponding wiki article that show show Tika pipes works. It's
> likely exactly what you are looking for. I'll make sure to send you a link
> to that when done.
>

Thanks for that, but at the moment I intend to stick to the Tika server, so
I hope this is used out of the box by Tika server.


>
>
> On Wed, Jun 23, 2021 at 6:47 AM Tim Allison <[email protected]> wrote:
>
>>
>> Sorry… Sergey Beryozkin
>>
>> On Wed, Jun 23, 2021 at 6:46 AM Tim Allison <[email protected]> wrote:
>>
>>> Hi Cristi,
>>>
>>>    I regret that I don't have precise answers for these questions.
>>> tika-server uses Apache cxf and most of your questions are handled at
>>> that level.  There is no logic in Tika for number of connections,
>>> identifying contention or even keeping track of the number of parallel
>>> requests.
>>>
>>>    If you're running in --spawnChild mode in 1.x or running in default
>>> in 2.x, the server can go down and drop connections if a file has
>>> caused a catastrophic problem (timeout, oom or other crash), but that
>>> doesn't necessarily mean that CPU will be saturated.
>>>
>>>    In practice, I've found that it is better to run multiple
>>> tika-servers (on different ports?) and have one tika-server per client
>>> so that you effectively avoid multithreading...this also enables you
>>> to know which file caused a catastrophic problem.  If you're running
>>> multiple requests on a single server, and one of the files causes a
>>> shutdown/restart, you won't know which of the active files caused the
>>> problem.
>>>
>>>    Nicholas DiPiazza has experience with pegging tika-servers.  He
>>> might be willing to chime in?
>>>
>>>    Sergey Beryokin is our cxf expert...he might have better insight on
>>> the cxf layer.
>>>
>>>    The above input applies to the standard /tika, /rmeta endpoints.
>>> The new pipes /pipes and /async handlers fork multiple sub-processes
>>> and do the parsing there.  I have not yet experimented with
>>> overwhelming them in practice/production, but the /async handler at
>>> least has a return value for "queue is full, please don't send any
>>> more requests".
>>>
>>
Tim, this is exactly what I was hoping is implemented in Tika server as
well. If the "queue is full" could translate into a 429 return code for the
client, that would be great.

Thanks!

Cristi



>
>>>      Best,
>>>
>>>           Tim
>>>
>>> On Tue, Jun 22, 2021 at 3:28 AM Cristian Zamfir <[email protected]>
>>> wrote:
>>> >
>>> > Hello, please let me know if somebody has looked into this or I should
>>> look at the source code instead? Thanks!
>>> >
>>> > On Fri, Jun 18, 2021 at 5:04 PM Cristian Zamfir <[email protected]>
>>> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I have a few questions about the concurrency level of tika-server in
>>> the default configuration:
>>> >> - how many connections will it accept before not accepting new
>>> connections?
>>> >> - how many files can be scanned in parallel?
>>> >> - what is the return code to expect when there is contention on the
>>> server?
>>> >> - is it a safe assumption that for connections to be dropped, CPU
>>> will be saturated?
>>> >>
>>> >> Thanks,
>>> >> Cristi
>>> >>
>>>
>>

Reply via email to