On 12/01/2022 13:47, Guus der Kinderen wrote:
Hi Emmanuel,
Thanks for your speedy reply. I have created an issue as you requested:
https://igniterealtime.atlassian.net/browse/OF-2367
<https://igniterealtime.atlassian.net/browse/OF-2367>
In your email, you're writing that the exception seems to be due to some
TLS handshake. How did you come to that conclusion? You might very well
be right, but I don't see how I could have given you any data to support
this hypothesis? :)
Well, you mentionned logs like:
javax.net.ssl.SSLHandshakeException: SSL handshake failed.
and I (quickly) jumped to conclusion that TLS might be a cause.
But it was 4 AM, and I agree it's pulled out of thin air.
A question though: is the SSLFilter before or after the ExecutorFilter
in your chain ? (I would assume before).
If it's before, then the HS failure would have no impact.
You're also writing:
I suspect that TLS 1.3 is being used on the client side, which is not
currently supported properly by MINA (just an hypothesis).
Is TLS 1.3 support in MINA definitely unsupported, or is that a
hypothesis? If TLS 1.3 support is lacking, then that is of concern for
me. That might be something to spin off to a different thread though, to
not muddy the waters too much.
We definitively have some issue with TLS 1.3, and this is being
investigated::
https://issues.apache.org/jira/browse/DIRMINA-1132
my bad, if we get an exception during the runTasks execution, we won't
exit the for(;;) loop, so this is plain normal to have an
idleWorkers.incrementAndGet(); in the finally part.
Are you sure? I believe that you were originally correct.
Well, I kow do think that the exception not being caught will be
propagated to the next catch, ie the one that remove the worker from the
pool.
In any case, the finally block should increase the number of idle
worker, and that is not going to be correct because the idleWorker count
is now going to be incremented while the worker has been removed.
I'm not sure the finally block is useful at all at this point.
Anyway, I just committed the change.
I've written a
test that throws an Exception during runTasks, and then compares the
workers size with the idleWorkers count. In my test, the idleWorkers is
_higher_ than the worker size (which I think is indicative of a bug). An
important note is that my test generates an Exception in a way that is
different from how the Executor would normally be used. I think I've
proven that _when_ an Exception occurs, things go wrong, but that I have
not yet found _how_ that Exception can occur.
While writing this email, I received what I believe to be conclusive
evidence of the above. I've asked a user that suffers from the problem
to run with a slightly modified version of MINA. The modification logs
(and rethrows) any Throwable in Worker's run(). This caused an Error to
be logged! The Error was thrown by an IoHandler implementation. Full
details are available at
https://igniterealtime.atlassian.net/browse/OF-2367?focusedCommentId=22738
<https://igniterealtime.atlassian.net/browse/OF-2367?focusedCommentId=22738>
I will modify the IoHandler implementation at our end to prevent this.
None-the-less, I believe that MINA should guard against this. I suspect
that removing the 'finally' block that is inside the for (;;) block of
Worker, and making execution of idleWorkers.incrementAndGet(); happen
only when no Throwable is thrown, is a huge improvement. In other words,
replace:
try {
if (session !=null) {
runTasks(getSessionTasksQueue(session));
}
}finally {
idleWorkers.incrementAndGet();
}
with this:
if (session !=null) {
runTasks(getSessionTasksQueue(session));
}
idleWorkers.incrementAndGet();
Wez see that we increment the idling workers counter before adding the
worker to the workers set.
I suspect that you are theoretically correct here. Given what I wrote
earlier, I don't think it is the current cause of the issue at hand. It
might be good to harden the code like this though.
Kind regards,
Guus
On Wed, 12 Jan 2022 at 04:49, Emmanuel Lécharny <elecha...@gmail.com
<mailto:elecha...@gmail.com>> wrote:
Hi again,
my bad, if we get an exception during the runTasks execution, we won't
exit the for(;;) loop, so this is plain normal to have an
idleWorkers.incrementAndGet(); in the finally part.
There are two places where we exit the loop:
- if we exceed the number of workers
- when session == EXIT_SIGNAL
and in those cases, the idleWorkers counter has been dully decremented,
so we should be good to go.
If we now have a look at the addWorker() method:
private void addWorker() {
synchronized (workers) {
if (workers.size() >= super.getMaximumPoolSize()) {
return;
}
// Create a new worker, and add it to the thread pool
Worker worker = new Worker();
Thread thread = getThreadFactory().newThread(worker);
// As we have added a new thread, it's considered as idle.
idleWorkers.incrementAndGet();
// Now, we can start it.
thread.start();
workers.add(worker);
Wez see that we increment the idling workers counter before adding the
worker to the workers set. It might explain why we see 2 idling workers
when the set has no worker.
May be the sequence should be :
workers.add(worker);
// As we have added a new thread, it's considered as idle.
idleWorkers.incrementAndGet();
// Now, we can start it.
thread.start();
That should be safe because we are in a synchronized section on workers.
On 12/01/2022 04:34, Emmanuel Lécharny wrote:
> Hi Guus,
>
> I agree there is some inconsistency, and I need to check why.
>
> Regarding the initial pb, it seems like an exception causes the
worker
> to be removed, and this exception seems to be due to some TLS
handshake
> failure.
>
> I suspect that TLS 1.3 is being used on the client side, which is
not
> currently supported properly by MINA (just an hypothesis).
>
> Regardless, there is something puzzling:
>
> private class Worker implements Runnable {
>
> public void run() {
>
> try {
> for (;;) {
> IoSession session = fetchSession();
>
> idleWorkers.decrementAndGet();
>
> if (session == null) {
> synchronized (workers) {
> if (workers.size() > getCorePoolSize()) {
> // Remove now to prevent duplicate exit.
> workers.remove(this); <<<-------------
> break;
> }
> ...
>
>
> } finally {
> synchronized (workers) {
> workers.remove(this); <<------------------------
>
>
> In the case where we reach the max pool size, we try to remove it
twice.
> Ok, it's not really critical, because the first call will
succeed, and
> the second will do nothing.
>
> And yes, if there is an exception during the runTasks, we will
have an
> increment of the idleWorkers despite the fact the worker will be
removed.
>
>
> Bottom line, there is a need to get this fixed.
>
> Can you open a JIRA for tracability?
>
>
> On 11/01/2022 17:26, Guus der Kinderen wrote:
>> Although I'm not sure if this is the source of my issue, I can
>> reproduce the state by creating a unit test that causes a
>> RuntimeException and/or Error to be thrown by a filter that is
>> processing an event.
>>
>> In OrderedThreadPoolExecutor's Worker, this snippet causes the
Worker
>> to be removed, but the idleWorker count to be increased in such a
>> scenario. Should the first try/finally block be removed perhaps
>> (causing idleWorkers to be increased only if runTasks didn't throw
>> anything?
>>
>> try {
>> if (session !=null) {
>> runTasks(getSessionTasksQueue(session));
>> }
>> }finally {
>> idleWorkers.incrementAndGet();
>> }
>> }
>> }finally {
>> synchronized (workers) {
>> workers.remove(this);
>> OrderedThreadPoolExecutor.this.completedTaskCount
>> +=completedTaskCount.get();
>> workers.notifyAll();
>> }
>> }
>>
>>
>> On Tue, 11 Jan 2022 at 15:41, Guus der Kinderen
>> <guus.der.kinde...@gmail.com
<mailto:guus.der.kinde...@gmail.com>
<mailto:guus.der.kinde...@gmail.com
<mailto:guus.der.kinde...@gmail.com>>> wrote:
>>
>> Hello,
>>
>> This was found with MINA 2.1.3. Is this a bug? I stumbled
upon this
>> while performing analysis of heap dumps of a JVMs that
suffered from
>> an issue where no user was able to establish a connection to our
>> server software anymore.
>>
>> Our application uses an ExecutorFilter. The
>> OrderedThreadPoolExecutor of the affected server seems to be
in an
>> inconsistent state:
>>
>> * idleWorkers: 2
>> * waitingSessions.count: 2293
>> * workers.map.size: 0
>>
>> I will attempt to include a screenshot of the profiler that
holds
>> more data (if the mailinglist allows me).
>>
>> What strikes me as odd is:
>>
>> * No workers, while there are sessions waiting to be processed
>> * No workers, but a non-zero idle workers count
>>
>> Servers that are unaffected by the issue have an idle worker
count
>> that is equal to the amount of workers (I assume that the
snapshots
>> were taken when that server was not actively processing data).
>>
>> For reference: the issue that we are experiencing (possibly
caused
>> by this) is tracked in
>> https://igniterealtime.atlassian.net/browse/OF-2367
<https://igniterealtime.atlassian.net/browse/OF-2367>
>> <https://igniterealtime.atlassian.net/browse/OF-2367
<https://igniterealtime.atlassian.net/browse/OF-2367>>. Multiple
>> people are reporting the issue, but I've been unable to
reproduce it
>> locally.
>>
>> Kind regards,
>>
>> Guus
>>
>> image.png
>>
>
--
*Emmanuel Lécharny - CTO* 205 Promenade des Anglais – 06200 NICE
T. +33 (0)4 89 97 36 50
P. +33 (0)6 08 33 32 61
emmanuel.lecha...@busit.com <mailto:emmanuel.lecha...@busit.com>
https://www.busit.com/ <https://www.busit.com/>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@mina.apache.org
<mailto:users-unsubscr...@mina.apache.org>
For additional commands, e-mail: users-h...@mina.apache.org
<mailto:users-h...@mina.apache.org>
--
*Emmanuel Lécharny - CTO* 205 Promenade des Anglais – 06200 NICE
T. +33 (0)4 89 97 36 50
P. +33 (0)6 08 33 32 61
emmanuel.lecha...@busit.com https://www.busit.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@mina.apache.org
For additional commands, e-mail: users-h...@mina.apache.org