Hi Cristi,
I regret that I don't have precise answers for these questions.
tika-server uses Apache cxf and most of your questions are handled at
that level. There is no logic in Tika for number of connections,
identifying contention or even keeping track of the number of parallel
requests.
If you're running in --spawnChild mode in 1.x or running in default
in 2.x, the server can go down and drop connections if a file has
caused a catastrophic problem (timeout, oom or other crash), but that
doesn't necessarily mean that CPU will be saturated.
In practice, I've found that it is better to run multiple
tika-servers (on different ports?) and have one tika-server per client
so that you effectively avoid multithreading...this also enables you
to know which file caused a catastrophic problem. If you're running
multiple requests on a single server, and one of the files causes a
shutdown/restart, you won't know which of the active files caused the
problem.
Nicholas DiPiazza has experience with pegging tika-servers. He
might be willing to chime in?
Sergey Beryokin is our cxf expert...he might have better insight on
the cxf layer.
The above input applies to the standard /tika, /rmeta endpoints.
The new pipes /pipes and /async handlers fork multiple sub-processes
and do the parsing there. I have not yet experimented with
overwhelming them in practice/production, but the /async handler at
least has a return value for "queue is full, please don't send any
more requests".
Best,
Tim
On Tue, Jun 22, 2021 at 3:28 AM Cristian Zamfir <[email protected]> wrote:
>
> Hello, please let me know if somebody has looked into this or I should look
> at the source code instead? Thanks!
>
> On Fri, Jun 18, 2021 at 5:04 PM Cristian Zamfir <[email protected]> wrote:
>>
>> Hi,
>>
>> I have a few questions about the concurrency level of tika-server in the
>> default configuration:
>> - how many connections will it accept before not accepting new connections?
>> - how many files can be scanned in parallel?
>> - what is the return code to expect when there is contention on the server?
>> - is it a safe assumption that for connections to be dropped, CPU will be
>> saturated?
>>
>> Thanks,
>> Cristi
>>