On Fri, Oct 4, 2019 at 10:38 PM Emmanuel Lecharny <elecha...@apache.org>
wrote:

> Hi remy,
>
> On 2019/10/04 15:37:36, Rémy Maucherat <r...@apache.org> wrote:
> > On Fri, Oct 4, 2019 at 3:40 PM Emmanuel Lecharny <elecha...@apache.org>
> > wrote:
> >
> > > Hi !
> > >
> > > I filled a ticket yesterday about a pb we face with many NIO framework,
> > > which I think could hit Tomcat too (see
> > > https://bz.apache.org/bugzilla/show_bug.cgi?id=63802). Actually, I
> think
> > > I'm facing this problem on a project I'm working on atm.
> > >
> > > Remy suggested we discuss it on this mailing list.
> > >
> > > Bottom line, what happens is that under some circumstances not well
> > > defined, the call to select() might end to an infinite loop eating all
> the
> > > CPU (select() returns 0, so select is immediately called again, and we
> > > loop).
> > >
> > > In various NIO framworks - and being a MINA committer, I have
> implemented
> > > the discussed workaround -, we are controlling this situation by
> breaking
> > > this infinite loop this way :
> > > - if the select() call returns 0
> > > - then if we have called select() more than N times in less than M ms
> > > (N=10, M=100 in MINA)
> > > - then we create a new Selector, register all the selectionKey that
> were
> > > registered on the broken selector, and ditch the old selector.
> > >
> > > This workaround does not cost a lot when the selector works as
> designed,
> > > as a select() call should never return 0.
> > >
> >
> > There's actually a very similar hack for APR that has been placed by
> myself
> > a long time ago [
> >
> https://github.com/apache/tomcat/blob/master/java/org/apache/tomcat/util/net/AprEndpoint.java#L1410
> > ], I don't even know if it's actually useful and it's certainly not
> > testable. Overall what it does is pretty terrible :(
> >
> > Personally I would like to know more about this "long lived bug either in
> > the JDK or even in Linux epoll implementation" like actual platform
> details
> > and JVM versions used since I've never heard about it in the first place.
>
> for the record, I had a discussion yesterday with one of my close friend
> and co-worker back in the 90's. He remember clearly, while working on the
> SUN TCP stack,  that such a problem occorded back then. Yes, 25 years
> ago... Ok, that was just for the fun, it's likely be perfectly unrelated ;-)
>
> At MINA, we were hit by this bug in 2009 (see
> https://issues.apache.org/jira/browse/DIRMINA-678), and it was linked to
> a bug reported on Jetty (
> http://jetty.4.x6.nabble.com/jira-Created-JETTY-937-SelectChannelConnector-100-CPU-usage-on-Linux-td36385.html),
> itself related to some JDK bugs, supposedly fixed since then.
>
> I had a long conversation with Jean-François Arcand somewhere around this
> date, and he suggested we adopt the same workaround he applied to Grizzly.
> We also had a convo with Alan Bateman during a Java One in SF, but nothing
> specific resulted from this convo, except that AFAICR, he aknowledge there
> is an issue.
>
> So this problem started with JDK 6, but I can't guarantee it wasn't
> already present in JDK 5 or 4, on linux, and not on any other OS like
> windows or Mac OSX. It's not exactly fresh in my mind, because it was
> already 10 years ago.
>

NIO support was added in Tomcat 6.0, supporting Java 5+, it wasn't very
good then. It's only with Java 6 that NIO started getting epoll support ant
I'm pretty sure the original issue did not actually survive. Despite the
popularity of the NIO connector this was not reported for Tomcat, if we got
the report at the same time as the others it would be more logical so
something is different here.
https://github.com/netty/netty/issues/327 has details but I'm still not
very convinced. You should give details on your platform and everything
else since it's obvious at this point this is far less common with Tomcat.


>
> > Also I'd like to know since NIO2 doesn't expose its poller and almost
> > certainly doesn't have such a platform specific mysterious thing inside
> it
> > [we can check I guess].
>
> No idea, but I think NIO.2 has just added some coating over what was NIO.1
> (guts feeling here...).
>

It's a new codebase with very clean code.


>
> In the context of NIO, do you have evidence the
> > hack has been tested to work (besides avoiding the CPU loop) and allowed
> > the server to continue its regular operation without any impact ?
>
> Absolutely. We do log in MINA when a new selector is created, and we have
> had some issue related to a case where this piece of code was called, fixed
> since :
> https://issues.apache.org/jira/browse/DIRMINA-762?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel
>
> So we definitively know that people get hit by the initial issue (select
> returns 0), a new selector is being created, and everything is fine from
> the user perspective (I do believe that creating the new selector and
> registering all the SelectionKey on it is not worse than having to restart
> the server manually...)
>
> In any case, Grizzly has probably the best possible approach to this
> problem: make the workaround optional.
>
> For Tomcat, I'm tempted to use the Http11AprProtocol class instead of the
> NIO one, as one can swap the protocol in the configuration, but the impact
> is that you need OpenSSL already installed on your machine. That would be
> an acceptable workaround in my case, but a painful one. A similar approach
> would be pleasant to have : a Http11NIONoSpinProtocol class that we can use
> if needed.
>

You should try the NIO2 connector first. I find this whole thing super odd
and want to investigate instead of simply hide a problem away.

Rémy


>
> WDYT ?
>
> Emmanuel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org
>
>

Reply via email to