On 12/01/2022 13:47, Guus der Kinderen wrote:
Hi Emmanuel,

Thanks for your speedy reply. I have created an issue as you requested: https://igniterealtime.atlassian.net/browse/OF-2367 <https://igniterealtime.atlassian.net/browse/OF-2367>

In your email, you're writing that the exception seems to be due to some TLS handshake. How did you come to that conclusion? You might very well be right, but I don't see how I could have given you any data to support this hypothesis? :)

Well, you mentionned logs like:

javax.net.ssl.SSLHandshakeException: SSL handshake failed.

and I (quickly) jumped to conclusion that TLS might be a cause.

But it was 4 AM, and I agree it's pulled out of thin air.


A question though: is the SSLFilter before or after the ExecutorFilter in your chain ? (I would assume before).

If it's before, then the HS failure would have no impact.


You're also writing:

    I suspect that TLS 1.3 is being used on the client side, which is not
    currently supported properly by MINA (just an hypothesis).


Is TLS 1.3 support in MINA definitely unsupported, or is that a hypothesis? If TLS 1.3 support is lacking, then that is of concern for me. That might be something to spin off to a different thread though, to not muddy the waters too much.

We definitively have some issue with TLS 1.3, and this is being investigated::

https://issues.apache.org/jira/browse/DIRMINA-1132


    my bad, if we get an exception during the runTasks execution, we won't
    exit the for(;;) loop, so this is plain normal to have an
    idleWorkers.incrementAndGet(); in the finally part.


Are you sure? I believe that you were originally correct.

Well, I kow do think that the exception not being caught will be propagated to the next catch, ie the one that remove the worker from the pool.

In any case, the finally block should increase the number of idle worker, and that is not going to be correct because the idleWorker count is now going to be incremented while the worker has been removed.

I'm not sure the finally block is useful at all at this point.

Anyway, I just committed the change.

I've written a
test that throws an Exception during runTasks, and then compares the workers size with the idleWorkers count. In my test, the idleWorkers is _higher_ than the worker size (which I think is indicative of a bug). An important note is that my test generates an Exception in a way that is different from how the Executor would normally be used. I think I've proven that _when_ an Exception occurs, things go wrong, but that I have not yet found _how_ that Exception can occur.

While writing this email, I received what I believe to be conclusive evidence of the above. I've asked a user that suffers from the problem to run with a slightly modified version of MINA. The modification logs (and rethrows) any Throwable in Worker's run(). This caused an Error to be logged! The Error was thrown by an IoHandler implementation. Full details are available at https://igniterealtime.atlassian.net/browse/OF-2367?focusedCommentId=22738 <https://igniterealtime.atlassian.net/browse/OF-2367?focusedCommentId=22738>

I will modify the IoHandler implementation at our end to prevent this. None-the-less, I believe that MINA should guard against this. I suspect that removing the 'finally' block that is inside the for (;;) block of Worker, and making execution of idleWorkers.incrementAndGet(); happen only when no Throwable is thrown, is a huge improvement. In other words, replace:

try {
         if (session !=null) {
             runTasks(getSessionTasksQueue(session));
         }
     }finally {
         idleWorkers.incrementAndGet();
     }

with this:

if (session !=null) {
         runTasks(getSessionTasksQueue(session));
     }
idleWorkers.incrementAndGet();

    Wez see that we increment the idling workers counter before adding the
    worker to the workers set.


I suspect that you are theoretically correct here. Given what I wrote earlier, I don't think it is the current cause of the issue at hand. It might be good to harden the code like this though.

Kind regards,

   Guus
On Wed, 12 Jan 2022 at 04:49, Emmanuel Lécharny <elecha...@gmail.com <mailto:elecha...@gmail.com>> wrote:

    Hi again,

    my bad, if we get an exception during the runTasks execution, we won't
    exit the for(;;) loop, so this is plain normal to have an
    idleWorkers.incrementAndGet(); in the finally part.

    There are two places where we exit the loop:
    - if we exceed the number of workers
    - when session == EXIT_SIGNAL

    and in those cases, the idleWorkers counter has been dully decremented,
    so we should be good to go.


    If we now have a look at the addWorker() method:

          private void addWorker() {
              synchronized (workers) {
                  if (workers.size() >= super.getMaximumPoolSize()) {
                      return;
                  }

                  // Create a new worker, and add it to the thread pool
                  Worker worker = new Worker();
                  Thread thread = getThreadFactory().newThread(worker);

                  // As we have added a new thread, it's considered as idle.
                  idleWorkers.incrementAndGet();

                  // Now, we can start it.
                  thread.start();
                  workers.add(worker);


    Wez see that we increment the idling workers counter before adding the
    worker to the workers set. It might explain why we see 2 idling workers
    when the set has no worker.

    May be the sequence should be :

                  workers.add(worker);

                  // As we have added a new thread, it's considered as idle.
                  idleWorkers.incrementAndGet();

                  // Now, we can start it.
                  thread.start();

    That should be safe because we are in a synchronized section on workers.


    On 12/01/2022 04:34, Emmanuel Lécharny wrote:
     > Hi Guus,
     >
     > I agree there is some inconsistency, and I need to check why.
     >
     > Regarding the initial pb, it seems like an exception causes the
    worker
     > to be removed, and this exception seems to be due to some TLS
    handshake
     > failure.
     >
     > I suspect that TLS 1.3 is being used on the client side, which is
    not
     > currently supported properly by MINA (just an hypothesis).
     >
     > Regardless, there is something puzzling:
     >
     > private class Worker implements Runnable {
     >
     >      public void run() {
     >
     >          try {
     >              for (;;) {
     >                  IoSession session = fetchSession();
     >
     >                  idleWorkers.decrementAndGet();
     >
     >                  if (session == null) {
     >                      synchronized (workers) {
     >                          if (workers.size() > getCorePoolSize()) {
     >                              // Remove now to prevent duplicate exit.
     >                              workers.remove(this);   <<<-------------
     >                              break;
     >                          }
     >              ...
     >
     >
     >              } finally {
     >                  synchronized (workers) {
     >                      workers.remove(this); <<------------------------
     >
     >
     > In the case where we reach the max pool size, we try to remove it
    twice.
     > Ok, it's not really critical, because the first call will
    succeed, and
     > the second will do nothing.
     >
     > And yes, if there is an exception during the runTasks, we will
    have an
     > increment of the idleWorkers despite the fact the worker will be
    removed.
     >
     >
     > Bottom line, there is a need to get this fixed.
     >
     > Can you open a JIRA for tracability?
     >
     >
     > On 11/01/2022 17:26, Guus der Kinderen wrote:
     >> Although I'm not sure if this is the source of my issue, I can
     >> reproduce the state by creating a unit test that causes a
     >> RuntimeException and/or Error to be thrown by a filter that is
     >> processing an event.
     >>
     >> In OrderedThreadPoolExecutor's Worker, this snippet causes the
    Worker
     >> to be removed, but the idleWorker count to be increased in such a
     >> scenario. Should the first try/finally block be removed perhaps
     >> (causing idleWorkers to be increased only if runTasks didn't throw
     >> anything?
     >>
     >>          try {
     >>              if (session !=null) {
     >>                  runTasks(getSessionTasksQueue(session));
     >>              }
     >>          }finally {
     >>              idleWorkers.incrementAndGet();
     >>          }
     >>      }
     >> }finally {
     >>      synchronized (workers) {
     >>          workers.remove(this);
     >>          OrderedThreadPoolExecutor.this.completedTaskCount
     >> +=completedTaskCount.get();
     >>          workers.notifyAll();
     >>      }
     >> }
     >>
     >>
     >> On Tue, 11 Jan 2022 at 15:41, Guus der Kinderen
     >> <guus.der.kinde...@gmail.com
    <mailto:guus.der.kinde...@gmail.com>
    <mailto:guus.der.kinde...@gmail.com
    <mailto:guus.der.kinde...@gmail.com>>> wrote:
     >>
     >>     Hello,
     >>
     >>     This was found with MINA 2.1.3. Is this a bug? I stumbled
    upon this
     >>     while performing analysis of heap dumps of a JVMs that
    suffered from
     >>     an issue where no user was able to establish a connection to our
     >>     server software anymore.
     >>
     >>     Our application uses an ExecutorFilter. The
     >>     OrderedThreadPoolExecutor of the affected server seems to be
    in an
     >>     inconsistent state:
     >>
     >>       * idleWorkers: 2
     >>       * waitingSessions.count: 2293
     >>       * workers.map.size: 0
     >>
     >>     I will attempt to include a screenshot of the profiler that
    holds
     >>     more data (if the mailinglist allows me).
     >>
     >>     What strikes me as odd is:
     >>
     >>       * No workers, while there are sessions waiting to be processed
     >>       * No workers, but a non-zero idle workers count
     >>
     >>     Servers that are unaffected by the issue have an idle worker
    count
     >>     that is equal to the amount of workers (I assume that the
    snapshots
     >>     were taken when that server was not actively processing data).
     >>
     >>     For reference: the issue that we are experiencing (possibly
    caused
     >>     by this) is tracked in
     >> https://igniterealtime.atlassian.net/browse/OF-2367
    <https://igniterealtime.atlassian.net/browse/OF-2367>
     >>     <https://igniterealtime.atlassian.net/browse/OF-2367
    <https://igniterealtime.atlassian.net/browse/OF-2367>>. Multiple
     >>     people are reporting the issue, but I've been unable to
    reproduce it
     >>     locally.
     >>
     >>     Kind regards,
     >>
     >>        Guus
     >>
     >>     image.png
     >>
     >

-- *Emmanuel Lécharny - CTO* 205 Promenade des Anglais – 06200 NICE
    T. +33 (0)4 89 97 36 50
    P. +33 (0)6 08 33 32 61
    emmanuel.lecha...@busit.com <mailto:emmanuel.lecha...@busit.com>
    https://www.busit.com/ <https://www.busit.com/>

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: users-unsubscr...@mina.apache.org
    <mailto:users-unsubscr...@mina.apache.org>
    For additional commands, e-mail: users-h...@mina.apache.org
    <mailto:users-h...@mina.apache.org>


--
*Emmanuel Lécharny - CTO* 205 Promenade des Anglais – 06200 NICE
T. +33 (0)4 89 97 36 50
P. +33 (0)6 08 33 32 61
emmanuel.lecha...@busit.com https://www.busit.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@mina.apache.org
For additional commands, e-mail: users-h...@mina.apache.org

Reply via email to