Hi Emmanuel,

Thanks for your speedy reply. I have created an issue as you requested:
https://igniterealtime.atlassian.net/browse/OF-2367

In your email, you're writing that the exception seems to be due to some
TLS handshake. How did you come to that conclusion? You might very well be
right, but I don't see how I could have given you any data to support this
hypothesis? :)

You're also writing:

I suspect that TLS 1.3 is being used on the client side, which is not
> currently supported properly by MINA (just an hypothesis).


Is TLS 1.3 support in MINA definitely unsupported, or is that a hypothesis?
If TLS 1.3 support is lacking, then that is of concern for me. That might
be something to spin off to a different thread though, to not muddy the
waters too much.

my bad, if we get an exception during the runTasks execution, we won't
> exit the for(;;) loop, so this is plain normal to have an
> idleWorkers.incrementAndGet(); in the finally part.


Are you sure? I believe that you were originally correct. I've written a
test that throws an Exception during runTasks, and then compares the
workers size with the idleWorkers count. In my test, the idleWorkers is
_higher_ than the worker size (which I think is indicative of a bug). An
important note is that my test generates an Exception in a way that is
different from how the Executor would normally be used. I think I've proven
that _when_ an Exception occurs, things go wrong, but that I have not yet
found _how_ that Exception can occur.

While writing this email, I received what I believe to be conclusive
evidence of the above. I've asked a user that suffers from the problem to
run with a slightly modified version of MINA. The modification logs (and
rethrows) any Throwable in Worker's run(). This caused an Error to be
logged! The Error was thrown by an IoHandler implementation. Full details
are available at
https://igniterealtime.atlassian.net/browse/OF-2367?focusedCommentId=22738

I will modify the IoHandler implementation at our end to prevent this.
None-the-less, I believe that MINA should guard against this. I suspect
that removing the 'finally' block that is inside the for (;;) block of
Worker, and making execution of idleWorkers.incrementAndGet(); happen only
when no Throwable is thrown, is a huge improvement. In other words, replace:

    try {
        if (session != null) {
            runTasks(getSessionTasksQueue(session));
        }
    } finally {
        idleWorkers.incrementAndGet();
    }

with this:

    if (session != null) {
        runTasks(getSessionTasksQueue(session));
    }
    idleWorkers.incrementAndGet();

Wez see that we increment the idling workers counter before adding the
> worker to the workers set.


I suspect that you are theoretically correct here. Given what I wrote
earlier, I don't think it is the current cause of the issue at hand. It
might be good to harden the code like this though.

Kind regards,

  Guus

On Wed, 12 Jan 2022 at 04:49, Emmanuel Lécharny <elecha...@gmail.com> wrote:

> Hi again,
>
> my bad, if we get an exception during the runTasks execution, we won't
> exit the for(;;) loop, so this is plain normal to have an
> idleWorkers.incrementAndGet(); in the finally part.
>
> There are two places where we exit the loop:
> - if we exceed the number of workers
> - when session == EXIT_SIGNAL
>
> and in those cases, the idleWorkers counter has been dully decremented,
> so we should be good to go.
>
>
> If we now have a look at the addWorker() method:
>
>      private void addWorker() {
>          synchronized (workers) {
>              if (workers.size() >= super.getMaximumPoolSize()) {
>                  return;
>              }
>
>              // Create a new worker, and add it to the thread pool
>              Worker worker = new Worker();
>              Thread thread = getThreadFactory().newThread(worker);
>
>              // As we have added a new thread, it's considered as idle.
>              idleWorkers.incrementAndGet();
>
>              // Now, we can start it.
>              thread.start();
>              workers.add(worker);
>
>
> Wez see that we increment the idling workers counter before adding the
> worker to the workers set. It might explain why we see 2 idling workers
> when the set has no worker.
>
> May be the sequence should be :
>
>              workers.add(worker);
>
>              // As we have added a new thread, it's considered as idle.
>              idleWorkers.incrementAndGet();
>
>              // Now, we can start it.
>              thread.start();
>
> That should be safe because we are in a synchronized section on workers.
>
>
> On 12/01/2022 04:34, Emmanuel Lécharny wrote:
> > Hi Guus,
> >
> > I agree there is some inconsistency, and I need to check why.
> >
> > Regarding the initial pb, it seems like an exception causes the worker
> > to be removed, and this exception seems to be due to some TLS handshake
> > failure.
> >
> > I suspect that TLS 1.3 is being used on the client side, which is not
> > currently supported properly by MINA (just an hypothesis).
> >
> > Regardless, there is something puzzling:
> >
> > private class Worker implements Runnable {
> >
> >      public void run() {
> >
> >          try {
> >              for (;;) {
> >                  IoSession session = fetchSession();
> >
> >                  idleWorkers.decrementAndGet();
> >
> >                  if (session == null) {
> >                      synchronized (workers) {
> >                          if (workers.size() > getCorePoolSize()) {
> >                              // Remove now to prevent duplicate exit.
> >                              workers.remove(this);   <<<-------------
> >                              break;
> >                          }
> >              ...
> >
> >
> >              } finally {
> >                  synchronized (workers) {
> >                      workers.remove(this); <<------------------------
> >
> >
> > In the case where we reach the max pool size, we try to remove it twice.
> > Ok, it's not really critical, because the first call will succeed, and
> > the second will do nothing.
> >
> > And yes, if there is an exception during the runTasks, we will have an
> > increment of the idleWorkers despite the fact the worker will be removed.
> >
> >
> > Bottom line, there is a need to get this fixed.
> >
> > Can you open a JIRA for tracability?
> >
> >
> > On 11/01/2022 17:26, Guus der Kinderen wrote:
> >> Although I'm not sure if this is the source of my issue, I can
> >> reproduce the state by creating a unit test that causes a
> >> RuntimeException and/or Error to be thrown by a filter that is
> >> processing an event.
> >>
> >> In OrderedThreadPoolExecutor's Worker, this snippet causes the Worker
> >> to be removed, but the idleWorker count to be increased in such a
> >> scenario. Should the first try/finally block be removed perhaps
> >> (causing idleWorkers to be increased only if runTasks didn't throw
> >> anything?
> >>
> >>          try {
> >>              if (session !=null) {
> >>                  runTasks(getSessionTasksQueue(session));
> >>              }
> >>          }finally {
> >>              idleWorkers.incrementAndGet();
> >>          }
> >>      }
> >> }finally {
> >>      synchronized (workers) {
> >>          workers.remove(this);
> >>          OrderedThreadPoolExecutor.this.completedTaskCount
> >> +=completedTaskCount.get();
> >>          workers.notifyAll();
> >>      }
> >> }
> >>
> >>
> >> On Tue, 11 Jan 2022 at 15:41, Guus der Kinderen
> >> <guus.der.kinde...@gmail.com <mailto:guus.der.kinde...@gmail.com>>
> wrote:
> >>
> >>     Hello,
> >>
> >>     This was found with MINA 2.1.3. Is this a bug? I stumbled upon this
> >>     while performing analysis of heap dumps of a JVMs that suffered from
> >>     an issue where no user was able to establish a connection to our
> >>     server software anymore.
> >>
> >>     Our application uses an ExecutorFilter. The
> >>     OrderedThreadPoolExecutor of the affected server seems to be in an
> >>     inconsistent state:
> >>
> >>       * idleWorkers: 2
> >>       * waitingSessions.count: 2293
> >>       * workers.map.size: 0
> >>
> >>     I will attempt to include a screenshot of the profiler that holds
> >>     more data (if the mailinglist allows me).
> >>
> >>     What strikes me as odd is:
> >>
> >>       * No workers, while there are sessions waiting to be processed
> >>       * No workers, but a non-zero idle workers count
> >>
> >>     Servers that are unaffected by the issue have an idle worker count
> >>     that is equal to the amount of workers (I assume that the snapshots
> >>     were taken when that server was not actively processing data).
> >>
> >>     For reference: the issue that we are experiencing (possibly caused
> >>     by this) is tracked in
> >>     https://igniterealtime.atlassian.net/browse/OF-2367
> >>     <https://igniterealtime.atlassian.net/browse/OF-2367>. Multiple
> >>     people are reporting the issue, but I've been unable to reproduce it
> >>     locally.
> >>
> >>     Kind regards,
> >>
> >>        Guus
> >>
> >>     image.png
> >>
> >
>
> --
> *Emmanuel Lécharny - CTO* 205 Promenade des Anglais – 06200 NICE
> T. +33 (0)4 89 97 36 50
> P. +33 (0)6 08 33 32 61
> emmanuel.lecha...@busit.com https://www.busit.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@mina.apache.org
> For additional commands, e-mail: users-h...@mina.apache.org
>
>

Reply via email to