Re: Tomcat 8 epoll spinning issue (100% CPU)

2019-10-08 Thread Emmanuel Lecharny



On 2019/10/07 10:18:43, Rémy Maucherat  wrote: 
> On Mon, Oct 7, 2019 at 11:15 AM Emmanuel Lecharny 
> wrote:
> 
> >
> >
> > On 2019/10/05 11:12:46, Rémy Maucherat  wrote:
> > > On Fri, Oct 4, 2019 at 10:38 PM Emmanuel Lecharny 
> > > wrote:
> > >
> > > > Hi remy,
> > > >
> > > > On 2019/10/04 15:37:36, Rémy Maucherat  wrote:
> > > > > On Fri, Oct 4, 2019 at 3:40 PM Emmanuel Lecharny <
> > elecha...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Hi !
> > > > > >
> > > > > > I filled a ticket yesterday about a pb we face with many NIO
> > framework,
> > > > > > which I think could hit Tomcat too (see
> > > > > > https://bz.apache.org/bugzilla/show_bug.cgi?id=63802). Actually, I
> > > > think
> > > > > > I'm facing this problem on a project I'm working on atm.
> > > > > >
> > > > > > Remy suggested we discuss it on this mailing list.
> > > > > >
> > > > > > Bottom line, what happens is that under some circumstances not well
> > > > > > defined, the call to select() might end to an infinite loop eating
> > all
> > > > the
> > > > > > CPU (select() returns 0, so select is immediately called again,
> > and we
> > > > > > loop).
> > > > > >
> > > > > > In various NIO framworks - and being a MINA committer, I have
> > > > implemented
> > > > > > the discussed workaround -, we are controlling this situation by
> > > > breaking
> > > > > > this infinite loop this way :
> > > > > > - if the select() call returns 0
> > > > > > - then if we have called select() more than N times in less than M
> > ms
> > > > > > (N=10, M=100 in MINA)
> > > > > > - then we create a new Selector, register all the selectionKey that
> > > > were
> > > > > > registered on the broken selector, and ditch the old selector.
> > > > > >
> > > > > > This workaround does not cost a lot when the selector works as
> > > > designed,
> > > > > > as a select() call should never return 0.
> > > > > >
> > > > >
> > > > > There's actually a very similar hack for APR that has been placed by
> > > > myself
> > > > > a long time ago [
> > > > >
> > > >
> > https://github.com/apache/tomcat/blob/master/java/org/apache/tomcat/util/net/AprEndpoint.java#L1410
> > > > > ], I don't even know if it's actually useful and it's certainly not
> > > > > testable. Overall what it does is pretty terrible :(
> > > > >
> > > > > Personally I would like to know more about this "long lived bug
> > either in
> > > > > the JDK or even in Linux epoll implementation" like actual platform
> > > > details
> > > > > and JVM versions used since I've never heard about it in the first
> > place.
> > > >
> > > > for the record, I had a discussion yesterday with one of my close
> > friend
> > > > and co-worker back in the 90's. He remember clearly, while working on
> > the
> > > > SUN TCP stack,  that such a problem occorded back then. Yes, 25 years
> > > > ago... Ok, that was just for the fun, it's likely be perfectly
> > unrelated ;-)
> > > >
> > > > At MINA, we were hit by this bug in 2009 (see
> > > > https://issues.apache.org/jira/browse/DIRMINA-678), and it was linked
> > to
> > > > a bug reported on Jetty (
> > > >
> > http://jetty.4.x6.nabble.com/jira-Created-JETTY-937-SelectChannelConnector-100-CPU-usage-on-Linux-td36385.html
> > ),
> > > > itself related to some JDK bugs, supposedly fixed since then.
> > > >
> > > > I had a long conversation with Jean-François Arcand somewhere around
> > this
> > > > date, and he suggested we adopt the same workaround he applied to
> > Grizzly.
> > > > We also had a convo with Alan Bateman during a Java One in SF, but
> > nothing
> > > > specific resulted from this convo, except that AFAICR, he aknowledge
> > there
> > > > is an issue.
> > > >
> > > > So this problem started with JDK 6, but I can'

Re: Tomcat 8 epoll spinning issue (100% CPU)

2019-10-07 Thread Emmanuel Lecharny



On 2019/10/05 11:12:46, Rémy Maucherat  wrote: 
> On Fri, Oct 4, 2019 at 10:38 PM Emmanuel Lecharny 
> wrote:
> 
> > Hi remy,
> >
> > On 2019/10/04 15:37:36, Rémy Maucherat  wrote:
> > > On Fri, Oct 4, 2019 at 3:40 PM Emmanuel Lecharny 
> > > wrote:
> > >
> > > > Hi !
> > > >
> > > > I filled a ticket yesterday about a pb we face with many NIO framework,
> > > > which I think could hit Tomcat too (see
> > > > https://bz.apache.org/bugzilla/show_bug.cgi?id=63802). Actually, I
> > think
> > > > I'm facing this problem on a project I'm working on atm.
> > > >
> > > > Remy suggested we discuss it on this mailing list.
> > > >
> > > > Bottom line, what happens is that under some circumstances not well
> > > > defined, the call to select() might end to an infinite loop eating all
> > the
> > > > CPU (select() returns 0, so select is immediately called again, and we
> > > > loop).
> > > >
> > > > In various NIO framworks - and being a MINA committer, I have
> > implemented
> > > > the discussed workaround -, we are controlling this situation by
> > breaking
> > > > this infinite loop this way :
> > > > - if the select() call returns 0
> > > > - then if we have called select() more than N times in less than M ms
> > > > (N=10, M=100 in MINA)
> > > > - then we create a new Selector, register all the selectionKey that
> > were
> > > > registered on the broken selector, and ditch the old selector.
> > > >
> > > > This workaround does not cost a lot when the selector works as
> > designed,
> > > > as a select() call should never return 0.
> > > >
> > >
> > > There's actually a very similar hack for APR that has been placed by
> > myself
> > > a long time ago [
> > >
> > https://github.com/apache/tomcat/blob/master/java/org/apache/tomcat/util/net/AprEndpoint.java#L1410
> > > ], I don't even know if it's actually useful and it's certainly not
> > > testable. Overall what it does is pretty terrible :(
> > >
> > > Personally I would like to know more about this "long lived bug either in
> > > the JDK or even in Linux epoll implementation" like actual platform
> > details
> > > and JVM versions used since I've never heard about it in the first place.
> >
> > for the record, I had a discussion yesterday with one of my close friend
> > and co-worker back in the 90's. He remember clearly, while working on the
> > SUN TCP stack,  that such a problem occorded back then. Yes, 25 years
> > ago... Ok, that was just for the fun, it's likely be perfectly unrelated ;-)
> >
> > At MINA, we were hit by this bug in 2009 (see
> > https://issues.apache.org/jira/browse/DIRMINA-678), and it was linked to
> > a bug reported on Jetty (
> > http://jetty.4.x6.nabble.com/jira-Created-JETTY-937-SelectChannelConnector-100-CPU-usage-on-Linux-td36385.html),
> > itself related to some JDK bugs, supposedly fixed since then.
> >
> > I had a long conversation with Jean-François Arcand somewhere around this
> > date, and he suggested we adopt the same workaround he applied to Grizzly.
> > We also had a convo with Alan Bateman during a Java One in SF, but nothing
> > specific resulted from this convo, except that AFAICR, he aknowledge there
> > is an issue.
> >
> > So this problem started with JDK 6, but I can't guarantee it wasn't
> > already present in JDK 5 or 4, on linux, and not on any other OS like
> > windows or Mac OSX. It's not exactly fresh in my mind, because it was
> > already 10 years ago.
> >
> 
> NIO support was added in Tomcat 6.0, supporting Java 5+, it wasn't very
> good then. It's only with Java 6 that NIO started getting epoll support ant
> I'm pretty sure the original issue did not actually survive. Despite the
> popularity of the NIO connector this was not reported for Tomcat, if we got
> the report at the same time as the others it would be more logical so
> something is different here.
> https://github.com/netty/netty/issues/327 has details but I'm still not
> very convinced. You should give details on your platform and everything
> else since it's obvious at this point this is far less common with Tomcat.

There is not much I can tell about this issue, beside what I already said. I 
can just stress out that for a few users of MINA, this was a real burden, and 
the very same for Netty, Grizzly and Jetty. I would be *very* surprised that 
those four different projects, all based on NIO, are facing such an issue, but 
that Tomcat is immune to it.

> You should try the NIO2 connector first. 

I'll do that right away. if it fixes the 100% CPU usage I see from time to 
time, then I would consider the issue resolved (there is no mean to workaround 
something in the NIO code if NIO2 solves it...)

Thanks !


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Tomcat 8 epoll spinning issue (100% CPU)

2019-10-04 Thread Emmanuel Lecharny



On 2019/10/04 22:47:17, Christopher Schultz  
wrote: 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Emmanuel,
> 
> On 10/4/19 16:38, Emmanuel Lecharny wrote:
> > Hi remy,
> >
> > On 2019/10/04 15:37:36, Rémy Maucherat  wrote:
> >> On Fri, Oct 4, 2019 at 3:40 PM Emmanuel Lecharny
> >>  wrote:
> >>
> >>> Hi !
> >>>
> >>> I filled a ticket yesterday about a pb we face with many NIO
> >>> framework, which I think could hit Tomcat too (see
> >>> https://bz.apache.org/bugzilla/show_bug.cgi?id=63802).
> >>> Actually, I think I'm facing this problem on a project I'm
> >>> working on atm.
> >>>
> >>> Remy suggested we discuss it on this mailing list.
> >>>
> >>> Bottom line, what happens is that under some circumstances not
> >>> well defined, the call to select() might end to an infinite
> >>> loop eating all the CPU (select() returns 0, so select is
> >>> immediately called again, and we loop).
> >>>
> >>> In various NIO framworks - and being a MINA committer, I have
> >>> implemented the discussed workaround -, we are controlling this
> >>> situation by breaking this infinite loop this way : - if the
> >>> select() call returns 0 - then if we have called select() more
> >>> than N times in less than M ms (N=10, M=100 in MINA) - then we
> >>> create a new Selector, register all the selectionKey that were
> >>> registered on the broken selector, and ditch the old selector.
> >>>
> >>> This workaround does not cost a lot when the selector works as
> >>> designed, as a select() call should never return 0.
> >>>
> >>
> >> There's actually a very similar hack for APR that has been placed
> >> by myself a long time ago [
> >> https://github.com/apache/tomcat/blob/master/java/org/apache/tomcat/u
> til/net/AprEndpoint.java#L1410
> >>
> >>
> ], I don't even know if it's actually useful and it's certainly not
> >> testable. Overall what it does is pretty terrible :(
> >>
> >> Personally I would like to know more about this "long lived bug
> >> either in the JDK or even in Linux epoll implementation" like
> >> actual platform details and JVM versions used since I've never
> >> heard about it in the first place.
> >
> > for the record, I had a discussion yesterday with one of my close
> > friend and co-worker back in the 90's. He remember clearly, while
> > working on the SUN TCP stack,  that such a problem occorded back
> > then. Yes, 25 years ago... Ok, that was just for the fun, it's
> > likely be perfectly unrelated ;-)
> >
> > At MINA, we were hit by this bug in 2009 (see
> > https://issues.apache.org/jira/browse/DIRMINA-678), and it was
> > linked to a bug reported on Jetty
> > (http://jetty.4.x6.nabble.com/jira-Created-JETTY-937-SelectChannelConn
> ector-100-CPU-usage-on-Linux-td36385.html),
> > itself related to some JDK bugs, supposedly fixed since then.
> >
> > I had a long conversation with Jean-François Arcand somewhere
> > around this date, and he suggested we adopt the same workaround he
> > applied to Grizzly. We also had a convo with Alan Bateman during a
> > Java One in SF, but nothing specific resulted from this convo,
> > except that AFAICR, he aknowledge there is an issue.
> >
> > So this problem started with JDK 6, but I can't guarantee it wasn't
> > already present in JDK 5 or 4, on linux, and not on any other OS
> > like windows or Mac OSX. It's not exactly fresh in my mind, because
> > it was already 10 years ago.
> >
> >> Also I'd like to know since NIO2 doesn't expose its poller and
> >> almost certainly doesn't have such a platform specific mysterious
> >> thing inside it [we can check I guess].
> >
> > No idea, but I think NIO.2 has just added some coating over what
> > was NIO.1 (guts feeling here...).
> >
> > In the context of NIO, do you have evidence the
> >> hack has been tested to work (besides avoiding the CPU loop) and
> >> allowed the server to continue its regular operation without any
> >> impact ?
> >
> > Absolutely. We do log in MINA when a new selector is created, and
> > we have had some issue related to a case where this piece of code
> > was called, fixed since :
> > https://issues.apache.org/jira/browse/DIRMINA-762?page=com.atlassian.j
> 

Re: Tomcat 8 epoll spinning issue (100% CPU)

2019-10-04 Thread Emmanuel Lecharny
Hi remy,

On 2019/10/04 15:37:36, Rémy Maucherat  wrote: 
> On Fri, Oct 4, 2019 at 3:40 PM Emmanuel Lecharny 
> wrote:
> 
> > Hi !
> >
> > I filled a ticket yesterday about a pb we face with many NIO framework,
> > which I think could hit Tomcat too (see
> > https://bz.apache.org/bugzilla/show_bug.cgi?id=63802). Actually, I think
> > I'm facing this problem on a project I'm working on atm.
> >
> > Remy suggested we discuss it on this mailing list.
> >
> > Bottom line, what happens is that under some circumstances not well
> > defined, the call to select() might end to an infinite loop eating all the
> > CPU (select() returns 0, so select is immediately called again, and we
> > loop).
> >
> > In various NIO framworks - and being a MINA committer, I have implemented
> > the discussed workaround -, we are controlling this situation by breaking
> > this infinite loop this way :
> > - if the select() call returns 0
> > - then if we have called select() more than N times in less than M ms
> > (N=10, M=100 in MINA)
> > - then we create a new Selector, register all the selectionKey that were
> > registered on the broken selector, and ditch the old selector.
> >
> > This workaround does not cost a lot when the selector works as designed,
> > as a select() call should never return 0.
> >
> 
> There's actually a very similar hack for APR that has been placed by myself
> a long time ago [
> https://github.com/apache/tomcat/blob/master/java/org/apache/tomcat/util/net/AprEndpoint.java#L1410
> ], I don't even know if it's actually useful and it's certainly not
> testable. Overall what it does is pretty terrible :(
> 
> Personally I would like to know more about this "long lived bug either in
> the JDK or even in Linux epoll implementation" like actual platform details
> and JVM versions used since I've never heard about it in the first place.

for the record, I had a discussion yesterday with one of my close friend and 
co-worker back in the 90's. He remember clearly, while working on the SUN TCP 
stack,  that such a problem occorded back then. Yes, 25 years ago... Ok, that 
was just for the fun, it's likely be perfectly unrelated ;-)

At MINA, we were hit by this bug in 2009 (see 
https://issues.apache.org/jira/browse/DIRMINA-678), and it was linked to a bug 
reported on Jetty 
(http://jetty.4.x6.nabble.com/jira-Created-JETTY-937-SelectChannelConnector-100-CPU-usage-on-Linux-td36385.html),
 itself related to some JDK bugs, supposedly fixed since then.

I had a long conversation with Jean-François Arcand somewhere around this date, 
and he suggested we adopt the same workaround he applied to Grizzly. We also 
had a convo with Alan Bateman during a Java One in SF, but nothing specific 
resulted from this convo, except that AFAICR, he aknowledge there is an issue.

So this problem started with JDK 6, but I can't guarantee it wasn't already 
present in JDK 5 or 4, on linux, and not on any other OS like windows or Mac 
OSX. It's not exactly fresh in my mind, because it was already 10 years ago.

> Also I'd like to know since NIO2 doesn't expose its poller and almost
> certainly doesn't have such a platform specific mysterious thing inside it
> [we can check I guess]. 

No idea, but I think NIO.2 has just added some coating over what was NIO.1 
(guts feeling here...).

In the context of NIO, do you have evidence the
> hack has been tested to work (besides avoiding the CPU loop) and allowed
> the server to continue its regular operation without any impact ?

Absolutely. We do log in MINA when a new selector is created, and we have had 
some issue related to a case where this piece of code was called, fixed since : 
https://issues.apache.org/jira/browse/DIRMINA-762?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

So we definitively know that people get hit by the initial issue (select 
returns 0), a new selector is being created, and everything is fine from the 
user perspective (I do believe that creating the new selector and registering 
all the SelectionKey on it is not worse than having to restart the server 
manually...)

In any case, Grizzly has probably the best possible approach to this problem: 
make the workaround optional. 

For Tomcat, I'm tempted to use the Http11AprProtocol class instead of the NIO 
one, as one can swap the protocol in the configuration, but the impact is that 
you need OpenSSL already installed on your machine. That would be an acceptable 
workaround in my case, but a painful one. A similar approach would be pleasant 
to have : a Http11NIONoSpinProtocol class that we can use if needed.

WDYT ?

Emmanuel

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Tomcat 8 epoll spinning issue (100% CPU)

2019-10-04 Thread Emmanuel Lecharny
Hi !

I filled a ticket yesterday about a pb we face with many NIO framework, which I 
think could hit Tomcat too (see 
https://bz.apache.org/bugzilla/show_bug.cgi?id=63802). Actually, I think I'm 
facing this problem on a project I'm working on atm.

Remy suggested we discuss it on this mailing list.

Bottom line, what happens is that under some circumstances not well defined, 
the call to select() might end to an infinite loop eating all the CPU (select() 
returns 0, so select is immediately called again, and we loop).

In various NIO framworks - and being a MINA committer, I have implemented the 
discussed workaround -, we are controlling this situation by breaking this 
infinite loop this way :
- if the select() call returns 0
- then if we have called select() more than N times in less than M ms (N=10, 
M=100 in MINA)
- then we create a new Selector, register all the selectionKey that were 
registered on the broken selector, and ditch the old selector.

This workaround does not cost a lot when the selector works as designed, as a 
select() call should never return 0. 

I suggest Tomcat add such a workaround in the various versions (8 and 9 at 
least).

Emmanuel Lécharny

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org