Re: 7.3.1 creates thousands of threads after start up
On 6/13/2018 4:04 AM, Markus Jelsma wrote: You mentioned shard handler tweaks, thanks. I see we have an incorrect setting there for maximumPoolSize, way too high, but that doesn't account for the number of threads created. After reducing the number, for dubious reasons, twice the number of threads are created and the node dies. The specific config in the shard handler I was thinking of was socket timeouts. The EofException in Jetty almost always indicates that the client disconnected before the server responded. A low socket timeout will cause a client to disconnect if the server takes longer than the timeout to finish its work. When the server finally does finish and try to respond, it throws the EofException because the connection it was expecting to use is no longer there. I believe that the default socket timeout in Solr's shard handler is 60 seconds. Which is a relative eternity in most situations. It would take a particularly nasty GC pause problem to exceed that timeout. Regarding memory leaks, of course, the first that came to mind is that i made an error which only causes trouble on 7.3, but it is unreproducible so far, even if i fully replicate production in a test environment. Since it only leaks on commits, first suspect were URPs, and the URPs are the only things i can disable in production without affecting customers. Needless to say, it weren't the URPs. Custom update processors could cause leaks, but it is not something that I would expect from a typical URP implementation. If you've disabled them and it's still happening, then that's probably not it. It's plugins for queries that have the most potential for not closing resources correctly even when written by experienced programmers. I'm not sure what the potential for leaks is on index-time plugins, but I suspect that it's less likely than problems with query plugins. Thanks, Shawn
RE: 7.3.1 creates thousands of threads after start up
Hello Shawn, You mentioned shard handler tweaks, thanks. I see we have an incorrect setting there for maximumPoolSize, way too high, but that doesn't account for the number of threads created. After reducing the number, for dubious reasons, twice the number of threads are created and the node dies. For a short time, there were two identical collections (just for different tests) on the nodes, i have removed one of them, but the number of threads created doesn't change one bit. So it appears shard handler config has nothing to do with it, or does it? Regarding memory leaks, of course, the first that came to mind is that i made an error which only causes trouble on 7.3, but it is unreproducible so far, even if i fully replicate production in a test environment. Since it only leaks on commits, first suspect were URPs, and the URPs are the only things i can disable in production without affecting customers. Needless to say, it weren't the URPs. But thanks anyway, whenever i have the courage again to tests it, i'll enable INFO logging, which is disabled. Maybe it will reveal something. If anyone has even the weirdest unconventional suggestion on how to reproduce my production memory leak in a controlled test environment, let me know/ Thanks, Markus -Original message- > From:Shawn Heisey > Sent: Sunday 10th June 2018 22:42 > To: solr-user@lucene.apache.org > Subject: Re: 7.3.1 creates thousands of threads after start up > > On 6/8/2018 8:59 AM, Markus Jelsma wrote: > > 2018-06-08 14:02:47.382 ERROR (qtp1458849419-1263) [ ] > > o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error > > trying to proxy request for url: http://idx2:8983/solr/ > > search/admin/ping > > > Caused by: org.eclipse.jetty.io.EofException > > If you haven't tweaked the shard handler config to drastically reduce > the socket timeout, that is weird. The only thing that comes to mind is > extreme GC pauses that cause the socket timeout to be exceeded. > > > We operate three distinct type of Solr collections, they only share the > > same Zookeeper quorum. The other two collections do not seem to have this > > problem, but i don't restart those as often as i restart this collection, > > as i am STILL trying to REPRODUCE the dreaded memory leak i reported having > > on 7.3 about two weeks ago. Sorry, but i drives me nuts! > > I've reviewed the list messages about the leak. As you might imagine, > my immediate thought is that the custom plugins you're running are > probably the cause, because we are not getting OOME reports like I would > expect if there were a leak in Solr itself. It would not be unheard of > for a custom plugin to experience no leaks with one Solr version but > leak when Solr is upgraded, requiring a change in the plugin to properly > close resources. I do not know if that's what's happening. > > A leak could lead to GC pause problems, but it does seem really odd for > that to happen on a Solr node that's just been started. You could try > bumping the heap size by 25 to 50 percent and see if the behavior > changes at all. Honestly I don't expect it to change, and if it > doesn't, then I do not know what the next troubleshooting step should > be. I could review your solr.log, though I can't be sure I would see > something you didn't. > > Thanks, > Shawn > >
Re: 7.3.1 creates thousands of threads after start up
On 6/8/2018 8:59 AM, Markus Jelsma wrote: > 2018-06-08 14:02:47.382 ERROR (qtp1458849419-1263) [ ] o.a.s.s.HttpSolrCall > null:org.apache.solr.common.SolrException: Error trying to proxy request for > url: http://idx2:8983/solr/ > search/admin/ping > Caused by: org.eclipse.jetty.io.EofException If you haven't tweaked the shard handler config to drastically reduce the socket timeout, that is weird. The only thing that comes to mind is extreme GC pauses that cause the socket timeout to be exceeded. > We operate three distinct type of Solr collections, they only share the same > Zookeeper quorum. The other two collections do not seem to have this problem, > but i don't restart those as often as i restart this collection, as i am > STILL trying to REPRODUCE the dreaded memory leak i reported having on 7.3 > about two weeks ago. Sorry, but i drives me nuts! I've reviewed the list messages about the leak. As you might imagine, my immediate thought is that the custom plugins you're running are probably the cause, because we are not getting OOME reports like I would expect if there were a leak in Solr itself. It would not be unheard of for a custom plugin to experience no leaks with one Solr version but leak when Solr is upgraded, requiring a change in the plugin to properly close resources. I do not know if that's what's happening. A leak could lead to GC pause problems, but it does seem really odd for that to happen on a Solr node that's just been started. You could try bumping the heap size by 25 to 50 percent and see if the behavior changes at all. Honestly I don't expect it to change, and if it doesn't, then I do not know what the next troubleshooting step should be. I could review your solr.log, though I can't be sure I would see something you didn't. Thanks, Shawn
RE: 7.3.1 creates thousands of threads after start up
Hello Deepak, Not in-between. Thanks, Markus -Original message- > From:Deepak Goel > Sent: Friday 8th June 2018 17:14 > To: solr-user@lucene.apache.org > Subject: Re: 7.3.1 creates thousands of threads after start up > > Do these machines have a firewall in-between? > > On Fri, 8 Jun 2018, 20:29 Markus Jelsma, wrote: > > > Hello Shawn, > > > > The logs appear useless, they are littered with these: > > > > 2018-06-08 14:02:47.382 ERROR (qtp1458849419-1263) [ ] > > o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error > > trying to proxy request for url: http://idx2:8983/solr/ > > search/admin/ping <http://idx2:8983/solr/search/admin/ping> > > at > > org.apache.solr.servlet.HttpSolrCall.remoteQuery(HttpSolrCall.java:647) > > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:501) > > at > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384) > > .. > > at > > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) > > at > > org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382) > > at > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708) > > at > > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626) > > at java.lang.Thread.run(Thread.java:748) > > Caused by: org.eclipse.jetty.io.EofException > > at > > org.eclipse.jetty.server.HttpConnection$SendCallback.reset(HttpConnection.java:704) > > .. > > at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:509) > > > > Regarding the versions, it is a bit hard to recall but i do no thing i > > have seen this on 7.2, most certainly not on 7.1. > > > > We operate three distinct type of Solr collections, they only share the > > same Zookeeper quorum. The other two collections do not seem to have this > > problem, but i don't restart those as often as i restart this collection, > > as i am STILL trying to REPRODUCE the dreaded memory leak i reported having > > on 7.3 about two weeks ago. Sorry, but i drives me nuts! > > > > Thanks, > > Markus > > > > -Original message- > > > From:Shawn Heisey > > > Sent: Friday 8th June 2018 16:47 > > > To: solr-user@lucene.apache.org > > > Subject: Re: 7.3.1 creates thousands of threads after start up > > > > > > On 6/8/2018 8:17 AM, Markus Jelsma wrote: > > > > Our local test environment mini cluster goes nuts right after start > > up. It is a two node/shard/replica collection starts up normally if only > > one node start up. But as soon as the second node attempts to join the > > cluster, both nodes go crazy, creating thousands of threads with identical > > stack traces. > > > > > > > > "qtp1458849419-4738" - Thread t@4738 > > > >java.lang.Thread.State: TIMED_WAITING > > > > at sun.misc.Unsafe.park(Native Method) > > > > - parking to wait for <6ee32168> (a > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > > > > at > > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > > > > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > > > > at > > org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392) > > > > at > > org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:600) > > > > at > > org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:49) > > > > at > > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:663) > > > > at java.lang.Thread.run(Thread.java:748) > > > > > > > >Locked ownable synchronizers: > > > > - None > > > > > > > > If does not happen always, but most of the time i am unable to boot > > the cluster normally. Sometimes, apparently right now for the first time, > > the GUI is still accessible. > > > > > > > > Is this a known issue? > > > > > > It's not a problem that I've heard of. There are no Solr classes in the > > > stacktrace, only Jetty and Java classes. I won't try to tell you that a > > > bug in Solr can't be the root cause, because it definitely can. The > > > threads appear to be created by Jetty, but the supplied info doesn't > > > indicate WHY it's happening. > > > > > > Presumably there's a previous version you've used where this problem did > > > NOT happen. What version would that be? > > > > > > Can you share the solr.log file from both nodes when this happens? > > > There might be a clue there. > > > > > > It sounds like you probably have a small number of collections in the > > > dev cluster. Can you confirm that? > > > > > > Thanks, > > > Shawn > > > > > > > > >
Re: 7.3.1 creates thousands of threads after start up
Do these machines have a firewall in-between? On Fri, 8 Jun 2018, 20:29 Markus Jelsma, wrote: > Hello Shawn, > > The logs appear useless, they are littered with these: > > 2018-06-08 14:02:47.382 ERROR (qtp1458849419-1263) [ ] > o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error > trying to proxy request for url: http://idx2:8983/solr/ > search/admin/ping <http://idx2:8983/solr/search/admin/ping> > at > org.apache.solr.servlet.HttpSolrCall.remoteQuery(HttpSolrCall.java:647) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:501) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384) > .. > at > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) > at > org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.eclipse.jetty.io.EofException > at > org.eclipse.jetty.server.HttpConnection$SendCallback.reset(HttpConnection.java:704) > .. > at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:509) > > Regarding the versions, it is a bit hard to recall but i do no thing i > have seen this on 7.2, most certainly not on 7.1. > > We operate three distinct type of Solr collections, they only share the > same Zookeeper quorum. The other two collections do not seem to have this > problem, but i don't restart those as often as i restart this collection, > as i am STILL trying to REPRODUCE the dreaded memory leak i reported having > on 7.3 about two weeks ago. Sorry, but i drives me nuts! > > Thanks, > Markus > > -Original message- > > From:Shawn Heisey > > Sent: Friday 8th June 2018 16:47 > > To: solr-user@lucene.apache.org > > Subject: Re: 7.3.1 creates thousands of threads after start up > > > > On 6/8/2018 8:17 AM, Markus Jelsma wrote: > > > Our local test environment mini cluster goes nuts right after start > up. It is a two node/shard/replica collection starts up normally if only > one node start up. But as soon as the second node attempts to join the > cluster, both nodes go crazy, creating thousands of threads with identical > stack traces. > > > > > > "qtp1458849419-4738" - Thread t@4738 > > >java.lang.Thread.State: TIMED_WAITING > > > at sun.misc.Unsafe.park(Native Method) > > > - parking to wait for <6ee32168> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > > > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > > > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > > > at > org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392) > > > at > org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:600) > > > at > org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:49) > > > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:663) > > > at java.lang.Thread.run(Thread.java:748) > > > > > >Locked ownable synchronizers: > > > - None > > > > > > If does not happen always, but most of the time i am unable to boot > the cluster normally. Sometimes, apparently right now for the first time, > the GUI is still accessible. > > > > > > Is this a known issue? > > > > It's not a problem that I've heard of. There are no Solr classes in the > > stacktrace, only Jetty and Java classes. I won't try to tell you that a > > bug in Solr can't be the root cause, because it definitely can. The > > threads appear to be created by Jetty, but the supplied info doesn't > > indicate WHY it's happening. > > > > Presumably there's a previous version you've used where this problem did > > NOT happen. What version would that be? > > > > Can you share the solr.log file from both nodes when this happens? > > There might be a clue there. > > > > It sounds like you probably have a small number of collections in the > > dev cluster. Can you confirm that? > > > > Thanks, > > Shawn > > > > >
RE: 7.3.1 creates thousands of threads after start up
Hello Shawn, The logs appear useless, they are littered with these: 2018-06-08 14:02:47.382 ERROR (qtp1458849419-1263) [ ] o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error trying to proxy request for url: http://idx2:8983/solr/ search/admin/ping at org.apache.solr.servlet.HttpSolrCall.remoteQuery(HttpSolrCall.java:647) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:501) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384) .. at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626) at java.lang.Thread.run(Thread.java:748) Caused by: org.eclipse.jetty.io.EofException at org.eclipse.jetty.server.HttpConnection$SendCallback.reset(HttpConnection.java:704) .. at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:509) Regarding the versions, it is a bit hard to recall but i do no thing i have seen this on 7.2, most certainly not on 7.1. We operate three distinct type of Solr collections, they only share the same Zookeeper quorum. The other two collections do not seem to have this problem, but i don't restart those as often as i restart this collection, as i am STILL trying to REPRODUCE the dreaded memory leak i reported having on 7.3 about two weeks ago. Sorry, but i drives me nuts! Thanks, Markus -Original message- > From:Shawn Heisey > Sent: Friday 8th June 2018 16:47 > To: solr-user@lucene.apache.org > Subject: Re: 7.3.1 creates thousands of threads after start up > > On 6/8/2018 8:17 AM, Markus Jelsma wrote: > > Our local test environment mini cluster goes nuts right after start up. It > > is a two node/shard/replica collection starts up normally if only one node > > start up. But as soon as the second node attempts to join the cluster, > > both nodes go crazy, creating thousands of threads with identical stack > > traces. > > > > "qtp1458849419-4738" - Thread t@4738 > >java.lang.Thread.State: TIMED_WAITING > > at sun.misc.Unsafe.park(Native Method) > > - parking to wait for <6ee32168> (a > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > > at > > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > > at > > org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392) > > at > > org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:600) > > at > > org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:49) > > at > > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:663) > > at java.lang.Thread.run(Thread.java:748) > > > >Locked ownable synchronizers: > > - None > > > > If does not happen always, but most of the time i am unable to boot the > > cluster normally. Sometimes, apparently right now for the first time, the > > GUI is still accessible. > > > > Is this a known issue? > > It's not a problem that I've heard of. There are no Solr classes in the > stacktrace, only Jetty and Java classes. I won't try to tell you that a > bug in Solr can't be the root cause, because it definitely can. The > threads appear to be created by Jetty, but the supplied info doesn't > indicate WHY it's happening. > > Presumably there's a previous version you've used where this problem did > NOT happen. What version would that be? > > Can you share the solr.log file from both nodes when this happens? > There might be a clue there. > > It sounds like you probably have a small number of collections in the > dev cluster. Can you confirm that? > > Thanks, > Shawn > >
Re: 7.3.1 creates thousands of threads after start up
On 6/8/2018 8:17 AM, Markus Jelsma wrote: > Our local test environment mini cluster goes nuts right after start up. It is > a two node/shard/replica collection starts up normally if only one node start > up. But as soon as the second node attempts to join the cluster, both nodes > go crazy, creating thousands of threads with identical stack traces. > > "qtp1458849419-4738" - Thread t@4738 >java.lang.Thread.State: TIMED_WAITING > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <6ee32168> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > at > org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:600) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:49) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:663) > at java.lang.Thread.run(Thread.java:748) > >Locked ownable synchronizers: > - None > > If does not happen always, but most of the time i am unable to boot the > cluster normally. Sometimes, apparently right now for the first time, the GUI > is still accessible. > > Is this a known issue? It's not a problem that I've heard of. There are no Solr classes in the stacktrace, only Jetty and Java classes. I won't try to tell you that a bug in Solr can't be the root cause, because it definitely can. The threads appear to be created by Jetty, but the supplied info doesn't indicate WHY it's happening. Presumably there's a previous version you've used where this problem did NOT happen. What version would that be? Can you share the solr.log file from both nodes when this happens? There might be a clue there. It sounds like you probably have a small number of collections in the dev cluster. Can you confirm that? Thanks, Shawn
7.3.1 creates thousands of threads after start up
Hello, Our local test environment mini cluster goes nuts right after start up. It is a two node/shard/replica collection starts up normally if only one node start up. But as soon as the second node attempts to join the cluster, both nodes go crazy, creating thousands of threads with identical stack traces. "qtp1458849419-4738" - Thread t@4738 java.lang.Thread.State: TIMED_WAITING at sun.misc.Unsafe.park(Native Method) - parking to wait for <6ee32168> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392) at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:600) at org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:49) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:663) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - None If does not happen always, but most of the time i am unable to boot the cluster normally. Sometimes, apparently right now for the first time, the GUI is still accessible. Is this a known issue? Thanks, Markus