Re: SolrCloud 4.x hangs under high update volume

Tim Vaillancourt Thu, 12 Sep 2013 10:45:39 -0700

Lol, at breaking during a demo - always the way it is! :) I agree, we are
just tip-toeing around the issue, but waiting for 4.5 is definitely an
option if we "get-by" for now in testing; patched Solr versions seem to
make people uneasy sometimes :).


Seeing there seems to be some danger to SOLR-5216 (in some ways it blows up
worse due to less limitations on thread), I'm guessing only SOLR-5232 and
SOLR-4816 are making it into 4.5? I feel those 2 in combination will make a
world of difference!

Thanks so much again guys!

Tim



On 12 September 2013 03:43, Erick Erickson <erickerick...@gmail.com> wrote:

> Fewer client threads updating makes sense, and going to 1 core also seems
> like it might help. But it's all a crap-shoot unless the underlying cause
> gets fixed up. Both would improve things, but you'll still hit the problem
> sometime, probably when doing a demo for your boss ;).
>
> Adrien has branched the code for SOLR 4.5 in preparation for a release
> candidate tentatively scheduled for next week. You might just start working
> with that branch if you can rather than apply individual patches...
>
> I suspect there'll be a couple more changes to this code (looks like
> Shikhar already raised an issue for instance) before 4.5 is finally cut...
>
> FWIW,
> Erick
>
>
>
> On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt <t...@elementspace.com
> >wrote:
>
> > Thanks Erick!
> >
> > Yeah, I think the next step will be CloudSolrServer with the SOLR-4816
> > patch. I think that is a very, very useful patch by the way. SOLR-5232
> > seems promising as well.
> >
> > I see your point on the more-shards idea, this is obviously a
> > global/instance-level lock. If I really had to, I suppose I could run
> more
> > Solr instances to reduce locking then? Currently I have 2 cores per
> > instance and I could go 1-to-1 to simplify things.
> >
> > The good news is we seem to be more stable since changing to a bigger
> > client->solr batch-size and fewer client threads updating.
> >
> > Cheers,
> >
> > Tim
> >
> > On 11/09/13 04:19 AM, Erick Erickson wrote:
> >
> >> If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent
> >> copy of the 4x branch. By "recent", I mean like today, it looks like
> Mark
> >> applied this early this morning. But several reports indicate that this
> >> will
> >> solve your problem.
> >>
> >> I would expect that increasing the number of shards would make the
> problem
> >> worse, not
> >> better.
> >>
> >> There's also SOLR-5232...
> >>
> >> Best
> >> Erick
> >>
> >>
> >> On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt<tim@elementspace.
> **com<t...@elementspace.com>
> >> >wrote:
> >>
> >>  Hey guys,
> >>>
> >>> Based on my understanding of the problem we are encountering, I feel
> >>> we've
> >>> been able to reduce the likelihood of this issue by making the
> following
> >>> changes to our app's usage of SolrCloud:
> >>>
> >>> 1) We increased our document batch size to 200 from 10 - our app
> batches
> >>> updates to reduce HTTP requests/overhead. The theory is increasing the
> >>> batch size reduces the likelihood of this issue happening.
> >>> 2) We reduced to 1 application node sending updates to SolrCloud - we
> >>> write
> >>> Solr updates to Redis, and have previously had 4 application nodes
> >>> pushing
> >>> the updates to Solr (popping off the Redis queue). Reducing the number
> of
> >>> nodes pushing to Solr reduces the concurrency on SolrCloud.
> >>> 3) Less threads pushing to SolrCloud - due to the increase in batch
> size,
> >>> we were able to go down to 5 update threads on the update-pushing-app
> >>> (from
> >>> 10 threads).
> >>>
> >>> To be clear the above only reduces the likelihood of the issue
> happening,
> >>> and DOES NOT actually resolve the issue at hand.
> >>>
> >>> If we happen to encounter issues with the above 3 changes, the next
> steps
> >>> (I could use some advice on) are:
> >>>
> >>> 1) Increase the number of shards (2x) - the theory here is this reduces
> >>> the
> >>> locking on shards because there are more shards. Am I onto something
> >>> here,
> >>> or will this not help at all?
> >>> 2) Use CloudSolrServer - currently we have a plain-old least-connection
> >>> HTTP VIP. If we go "direct" to what we need to update, this will reduce
> >>> concurrency in SolrCloud a bit. Thoughts?
> >>>
> >>> Thanks all!
> >>>
> >>> Cheers,
> >>>
> >>> Tim
> >>>
> >>>
> >>> On 6 September 2013 14:47, Tim Vaillancourt<tim@elementspace.**com<
> t...@elementspace.com>>
> >>>  wrote:
> >>>
> >>>  Enjoy your trip, Mark! Thanks again for the help!
> >>>>
> >>>> Tim
> >>>>
> >>>>
> >>>> On 6 September 2013 14:18, Mark Miller<markrmil...@gmail.com>  wrote:
> >>>>
> >>>>  Okay, thanks, useful info. Getting on a plane, but ill look more at
> >>>>> this
> >>>>> soon. That 10k thread spike is good to know - that's no good and
> could
> >>>>> easily be part of the problem. We want to keep that from happening.
> >>>>>
> >>>>> Mark
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt<tim@elementspace.**com<
> t...@elementspace.com>
> >>>>> >
> >>>>> wrote:
> >>>>>
> >>>>>  Hey Mark,
> >>>>>>
> >>>>>> The farthest we've made it at the same batch size/volume was 12
> hours
> >>>>>> without this patch, but that isn't consistent. Sometimes we would
> only
> >>>>>>
> >>>>> get
> >>>>>
> >>>>>> to 6 hours or less.
> >>>>>>
> >>>>>> During the crash I can see an amazing spike in threads to 10k which
> is
> >>>>>> essentially our ulimit for the JVM, but I strangely see no
> >>>>>>
> >>>>> "OutOfMemory:
> >>>
> >>>> cannot open native thread errors" that always follow this. Weird!
> >>>>>>
> >>>>>> We also notice a spike in CPU around the crash. The instability
> caused
> >>>>>>
> >>>>> some
> >>>>>
> >>>>>> shard recovery/replication though, so that CPU may be a symptom of
> the
> >>>>>> replication, or is possibly the root cause. The CPU spikes from
> about
> >>>>>> 20-30% utilization (system + user) to 60% fairly sharply, so the
> CPU,
> >>>>>>
> >>>>> while
> >>>>>
> >>>>>> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons,
> >>>>>>
> >>>>> whole
> >>>>>
> >>>>>> index is in 128GB RAM, 6xRAID10 15k).
> >>>>>>
> >>>>>> More on resources: our disk I/O seemed to spike about 2x during the
> >>>>>>
> >>>>> crash
> >>>>>
> >>>>>> (about 1300kbps written to 3500kbps), but this may have been the
> >>>>>> replication, or ERROR logging (we generally log nothing due to
> >>>>>> WARN-severity unless something breaks).
> >>>>>>
> >>>>>> Lastly, I found this stack trace occurring frequently, and have no
> >>>>>>
> >>>>> idea
> >>>
> >>>> what it is (may be useful or not):
> >>>>>>
> >>>>>> "java.lang.**IllegalStateException :
> >>>>>>       at
> >>>>>>
> >>>>> org.eclipse.jetty.server.**Response.resetBuffer(Response.**java:964)
> >>>
> >>>>       at org.eclipse.jetty.server.**Response.sendError(Response.**
> >>>>>> java:325)
> >>>>>>       at
> >>>>>>
> >>>>>>  org.apache.solr.servlet.**SolrDispatchFilter.sendError(**
> >>> SolrDispatchFilter.java:692)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>> SolrDispatchFilter.java:380)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>> SolrDispatchFilter.java:155)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
> >>> doFilter(ServletHandler.java:**1423)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
> >>> ServletHandler.java:450)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>> ScopedHandler.java:138)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.security.**SecurityHandler.handle(**
> >>> SecurityHandler.java:564)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**session.SessionHandler.**
> >>> doHandle(SessionHandler.java:**213)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.ContextHandler.**
> >>> doHandle(ContextHandler.java:**1083)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.servlet.**ServletHandler.doScope(**
> >>> ServletHandler.java:379)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**session.SessionHandler.**
> >>> doScope(SessionHandler.java:**175)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.ContextHandler.**
> >>> doScope(ContextHandler.java:**1017)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>> ScopedHandler.java:136)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.**ContextHandlerCollection.**
> >>> handle(**ContextHandlerCollection.java:**258)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.HandlerCollection.**
> >>> handle(HandlerCollection.java:**109)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
> >>> HandlerWrapper.java:97)
> >>>
> >>>>       at org.eclipse.jetty.server.**Server.handle(Server.java:445)
> >>>>>>       at
> >>>>>>
> >>>>> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260)
> >>>>>
> >>>>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**HttpConnection.onFillable(**
> >>> HttpConnection.java:225)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
> >>> AbstractConnection.java:358)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
> >>> QueuedThreadPool.java:596)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
> >>> QueuedThreadPool.java:527)
> >>>
> >>>>       at java.lang.Thread.run(Thread.**java:724)"
> >>>>>>
> >>>>>> On your live_nodes question, I don't have historical data on this
> from
> >>>>>>
> >>>>> when
> >>>>>
> >>>>>> the crash occurred, which I guess is what you're looking for. I
> could
> >>>>>>
> >>>>> add
> >>>>>
> >>>>>> this to our monitoring for future tests, however. I'd be glad to
> >>>>>>
> >>>>> continue
> >>>>>
> >>>>>> further testing, but I think first more monitoring is needed to
> >>>>>>
> >>>>> understand
> >>>>>
> >>>>>> this further. Could we come up with a list of metrics that would be
> >>>>>>
> >>>>> useful
> >>>>>
> >>>>>> to see following another test and successful crash?
> >>>>>>
> >>>>>> Metrics needed:
> >>>>>>
> >>>>>> 1) # of live_nodes.
> >>>>>> 2) Full stack traces.
> >>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide).
> >>>>>> 4) Solr's JVM thread count (already done)
> >>>>>> 5) ?
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Tim Vaillancourt
> >>>>>>
> >>>>>>
> >>>>>> On 6 September 2013 13:11, Mark Miller<markrmil...@gmail.com>
>  wrote:
> >>>>>>
> >>>>>>  Did you ever get to index that long before without hitting the
> >>>>>>>
> >>>>>> deadlock?
> >>>>>
> >>>>>> There really isn't anything negative the patch could be introducing,
> >>>>>>>
> >>>>>> other
> >>>>>
> >>>>>> than allowing for some more threads to possibly run at once. If I
> had
> >>>>>>>
> >>>>>> to
> >>>>>
> >>>>>> guess, I would say its likely this patch fixes the deadlock issue
> and
> >>>>>>>
> >>>>>> your
> >>>>>
> >>>>>> seeing another issue - which looks like the system cannot keep up
> >>>>>>>
> >>>>>> with
> >>>
> >>>> the
> >>>>>
> >>>>>> requests or something for some reason - perhaps due to some OS
> >>>>>>>
> >>>>>> networking
> >>>>>
> >>>>>> settings or something (more guessing). Connection refused happens
> >>>>>>>
> >>>>>> generally
> >>>>>
> >>>>>> when there is nothing listening on the port.
> >>>>>>>
> >>>>>>> Do you see anything interesting change with the rest of the system?
> >>>>>>>
> >>>>>> CPU
> >>>
> >>>> usage spikes or something like that?
> >>>>>>>
> >>>>>>> Clamping down further on the overall number of threads night help
> >>>>>>>
> >>>>>> (which
> >>>>>
> >>>>>> would require making something configurable). How many nodes are
> >>>>>>>
> >>>>>> listed in
> >>>>>
> >>>>>> zk under live_nodes?
> >>>>>>>
> >>>>>>> Mark
> >>>>>>>
> >>>>>>> Sent from my iPhone
> >>>>>>>
> >>>>>>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt<tim@elementspace.
> **com<t...@elementspace.com>
> >>>>>>> >
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>  Hey guys,
> >>>>>>>>
> >>>>>>>> (copy of my post to SOLR-5216)
> >>>>>>>>
> >>>>>>>> We tested this patch and unfortunately encountered some serious
> >>>>>>>>
> >>>>>>> issues a
> >>>>>
> >>>>>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so
> >>>>>>>>
> >>>>>>> we
> >>>>>
> >>>>>> are
> >>>>>>>
> >>>>>>>> writing about 5000 docs/sec total, using autoCommit to commit the
> >>>>>>>>
> >>>>>>> updates
> >>>>>
> >>>>>> (no explicit commits).
> >>>>>>>>
> >>>>>>>> Our environment:
> >>>>>>>>
> >>>>>>>>    Solr 4.3.1 w/SOLR-5216 patch.
> >>>>>>>>    Jetty 9, Java 1.7.
> >>>>>>>>    3 solr instances, 1 per physical server.
> >>>>>>>>    1 collection.
> >>>>>>>>    3 shards.
> >>>>>>>>    2 replicas (each instance is a leader and a replica).
> >>>>>>>>    Soft autoCommit is 1000ms.
> >>>>>>>>    Hard autoCommit is 15000ms.
> >>>>>>>>
> >>>>>>>> After about 6 hours of stress-testing this patch, we see many of
> >>>>>>>>
> >>>>>>> these
> >>>
> >>>> stalled transactions (below), and the Solr instances start to see
> >>>>>>>>
> >>>>>>> each
> >>>
> >>>> other as down, flooding our Solr logs with "Connection Refused"
> >>>>>>>>
> >>>>>>> exceptions,
> >>>>>>>
> >>>>>>>> and otherwise no obviously-useful logs that I could see.
> >>>>>>>>
> >>>>>>>> I did notice some stalled transactions on both /select and
> /update,
> >>>>>>>> however. This never occurred without this patch.
> >>>>>>>>
> >>>>>>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> >>>>>>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> >>>>>>>>
> >>>>>>>> Lastly, I have a summary of the ERROR-severity logs from this
> >>>>>>>>
> >>>>>>> 24-hour
> >>>
> >>>> soak.
> >>>>>>>
> >>>>>>>> My script "normalizes" the ERROR-severity stack traces and returns
> >>>>>>>>
> >>>>>>> them
> >>>>>
> >>>>>> in
> >>>>>>>
> >>>>>>>> order of occurrence.
> >>>>>>>>
> >>>>>>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>> Tim Vaillancourt
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 6 September 2013 07:27, Markus Jelsma<
> >>>>>>>>
> >>>>>>> markus.jel...@openindex.io>
> >>>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>>
> >>>>>>>>> -----Original message-----
> >>>>>>>>>
> >>>>>>>>>> From:Erick Erickson<erickerickson@gmail.**com<
> erickerick...@gmail.com>
> >>>>>>>>>> >
> >>>>>>>>>> Sent: Friday 6th September 2013 16:20
> >>>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>>>>>>
> >>>>>>>>>> Markus:
> >>>>>>>>>>
> >>>>>>>>>> See: https://issues.apache.org/**jira/browse/SOLR-5216<
> https://issues.apache.org/jira/browse/SOLR-5216>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> >>>>>>>>>> <markus.jel...@openindex.io>**wrote:
> >>>>>>>>>>
> >>>>>>>>>>  Hi Mark,
> >>>>>>>>>>>
> >>>>>>>>>>> Got an issue to watch?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Markus
> >>>>>>>>>>>
> >>>>>>>>>>> -----Original message-----
> >>>>>>>>>>>
> >>>>>>>>>>>> From:Mark Miller<markrmil...@gmail.com>
> >>>>>>>>>>>> Sent: Wednesday 4th September 2013 16:55
> >>>>>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm going to try and fix the root cause for 4.5 - I've
> suspected
> >>>>>>>>>>>>
> >>>>>>>>>>> what it
> >>>>>>>>>
> >>>>>>>>>> is since early this year, but it's never personally been an
> >>>>>>>>>>>
> >>>>>>>>>> issue,
> >>>
> >>>> so
> >>>>>
> >>>>>> it's
> >>>>>>>>>
> >>>>>>>>>> rolled along for a long time.
> >>>>>>>>>>>
> >>>>>>>>>>>> Mark
> >>>>>>>>>>>>
> >>>>>>>>>>>> Sent from my iPhone
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt<
> >>>>>>>>>>>>
> >>>>>>>>>>> t...@elementspace.com>
> >>>>>
> >>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hey guys,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I am looking into an issue we've been having with SolrCloud
> >>>>>>>>>>>>>
> >>>>>>>>>>>> since
> >>>
> >>>> the
> >>>>>>>>>
> >>>>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
> >>>>>>>>>>>>>
> >>>>>>>>>>>> tested
> >>>>>>>>>
> >>>>>>>>>> 4.4.0
> >>>>>>>>>>>
> >>>>>>>>>>>> yet). I've noticed other users with this same issue, so I'd
> >>>>>>>>>>>>>
> >>>>>>>>>>>> really
> >>>>>
> >>>>>> like to
> >>>>>>>>>>>
> >>>>>>>>>>>> get to the bottom of it.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Under a very, very high rate of updates (2000+/sec), after
> 1-12
> >>>>>>>>>>>>>
> >>>>>>>>>>>> hours
> >>>>>>>>>
> >>>>>>>>>> we
> >>>>>>>>>>>
> >>>>>>>>>>>> see stalled transactions that snowball to consume all Jetty
> >>>>>>>>>>>>>
> >>>>>>>>>>>> threads in
> >>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>>>>>
> >>>>>>>>>>>> JVM. This eventually causes the JVM to hang with most threads
> >>>>>>>>>>>>>
> >>>>>>>>>>>> waiting
> >>>>>>>>>
> >>>>>>>>>> on
> >>>>>>>>>>>
> >>>>>>>>>>>> the condition/stack provided at the bottom of this message. At
> >>>>>>>>>>>>>
> >>>>>>>>>>>> this
> >>>>>
> >>>>>> point
> >>>>>>>>>>>
> >>>>>>>>>>>> SolrCloud instances then start to see their neighbors (who
> also
> >>>>>>>>>>>>>
> >>>>>>>>>>>> have
> >>>>>>>>>
> >>>>>>>>>> all
> >>>>>>>>>>>
> >>>>>>>>>>>> threads hung) as down w/"Connection Refused", and the shards
> >>>>>>>>>>>>>
> >>>>>>>>>>>> become
> >>>>>
> >>>>>> "down"
> >>>>>>>>>>>
> >>>>>>>>>>>> in state. Sometimes a node or two survives and just returns
> >>>>>>>>>>>>>
> >>>>>>>>>>>> 503s
> >>>
> >>>> "no
> >>>>>>>>>
> >>>>>>>>>> server
> >>>>>>>>>>>
> >>>>>>>>>>>> hosting shard" errors.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As a workaround/experiment, we have tuned the number of
> threads
> >>>>>>>>>>>>>
> >>>>>>>>>>>> sending
> >>>>>>>>>
> >>>>>>>>>> updates to Solr, as well as the batch size (we batch updates
> >>>>>>>>>>>>>
> >>>>>>>>>>>> from
> >>>
> >>>> client ->
> >>>>>>>>>>>
> >>>>>>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning
> >>>>>>>>>>>>>
> >>>>>>>>>>>> off
> >>>>>
> >>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
> >>>>>>>>>>>>>
> >>>>>>>>>>>> did not
> >>>>>>>>>
> >>>>>>>>>> help. Certain combinations of update threads and batch sizes
> >>>>>>>>>>>>>
> >>>>>>>>>>>> seem
> >>>
> >>>> to
> >>>>>>>>>
> >>>>>>>>>> mask/help the problem, but not resolve it entirely.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Our current environment is the following:
> >>>>>>>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> >>>>>>>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
> >>>>>>>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader
> of
> >>>>>>>>>>>>>
> >>>>>>>>>>>> 1
> >>>
> >>>> shard
> >>>>>>>>>
> >>>>>>>>>> and
> >>>>>>>>>>>
> >>>>>>>>>>>> a replica of 1 shard).
> >>>>>>>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no
> >>>>>>>>>>>>>
> >>>>>>>>>>>> movement
> >>>
> >>>> on a
> >>>>>>>>>
> >>>>>>>>>> good
> >>>>>>>>>>>
> >>>>>>>>>>>> day.
> >>>>>>>>>>>>> - 5000 max jetty threads (well above what we use when we are
> >>>>>>>>>>>>>
> >>>>>>>>>>>> healthy),
> >>>>>>>>>
> >>>>>>>>>> Linux-user threads ulimit is 6000.
> >>>>>>>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
> >>>>>>>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
> >>>>>>>>>>>>> - Occurs under several JVM tunings.
> >>>>>>>>>>>>> - Everything seems to point to Solr itself, and not a Jetty
> or
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Java
> >>>>>
> >>>>>> version
> >>>>>>>>>>>
> >>>>>>>>>>>> (I hope I'm wrong).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The stack trace that is holding up all my Jetty QTP threads
> is
> >>>>>>>>>>>>>
> >>>>>>>>>>>> the
> >>>>>
> >>>>>> following, which seems to be waiting on a lock that I would
> >>>>>>>>>>>>>
> >>>>>>>>>>>> very
> >>>
> >>>> much
> >>>>>>>>>
> >>>>>>>>>> like
> >>>>>>>>>>>
> >>>>>>>>>>>> to understand further:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> "java.lang.Thread.State: WAITING (parking)
> >>>>>>>>>>>>>   at sun.misc.Unsafe.park(Native Method)
> >>>>>>>>>>>>>   - parking to wait for<0x00000007216e68d8>  (a
> >>>>>>>>>>>>> java.util.concurrent.**Semaphore$NonfairSync)
> >>>>>>>>>>>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> java.util.concurrent.locks.**LockSupport.park(LockSupport.**
> >>>>>>>>> java:186)
> >>>>>>>>>
> >>>>>>>>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> >>> parkAndCheckInterrupt(**AbstractQueuedSynchronizer.**java:834)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> >>> doAcquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:994)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> >>> acquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:1303)
> >>>
> >>>>   at java.util.concurrent.**Semaphore.acquire(Semaphore.**java:317)
> >>>>>>>>>>>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.util.**AdjustableSemaphore.acquire(**
> >>> AdjustableSemaphore.java:61)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
> >>> SolrCmdDistributor.java:418)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
> >>> SolrCmdDistributor.java:368)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.flushAdds(**
> >>> SolrCmdDistributor.java:300)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.finish(**
> >>> SolrCmdDistributor.java:96)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.update.**processor.**
> >>> DistributedUpdateProcessor.**doFinish(**DistributedUpdateProcessor.**
> >>> java:462)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.update.**processor.**
> >>> DistributedUpdateProcessor.**finish(**DistributedUpdateProcessor.**
> >>> java:1178)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.handler.**ContentStreamHandlerBase.**
> >>> handleRequestBody(**ContentStreamHandlerBase.java:**83)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
> >>> RequestHandlerBase.java:135)
> >>>
> >>>>   at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1820)
> >>>>>>>>>>>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.execute(**
> >>> SolrDispatchFilter.java:656)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>> SolrDispatchFilter.java:359)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>> SolrDispatchFilter.java:155)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
> >>> doFilter(ServletHandler.java:**1486)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
> >>> ServletHandler.java:503)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>> ScopedHandler.java:138)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(**
> >>> SecurityHandler.java:564)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> >>> doHandle(SessionHandler.java:**213)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> >>> doHandle(ContextHandler.java:**1096)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(**
> >>> ServletHandler.java:432)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> >>> doScope(SessionHandler.java:**175)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> >>> doScope(ContextHandler.java:**1030)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>> ScopedHandler.java:136)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.*
> >>> *handle(**ContextHandlerCollection.java:**201)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.**
> >>> handle(HandlerCollection.java:**109)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
> >>> HandlerWrapper.java:97)
> >>>
> >>>>   at org.eclipse.jetty.server.**Server.handle(Server.java:445)
> >>>>>>>>>>>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**HttpChannel.handle(**
> >>>>>>>>> HttpChannel.java:268)
> >>>>>>>>>
> >>>>>>>>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(**
> >>> HttpConnection.java:229)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
> >>> AbstractConnection.java:358)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
> >>> QueuedThreadPool.java:601)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
> >>> QueuedThreadPool.java:532)
> >>>
> >>>>   at java.lang.Thread.run(Thread.**java:724)"
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Some questions I had were:
> >>>>>>>>>>>>> 1) What exclusive locks does SolrCloud "make" when performing
> >>>>>>>>>>>>>
> >>>>>>>>>>>> an
> >>>
> >>>> update?
> >>>>>>>>>>>
> >>>>>>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D),
> >>>>>>>>>>>>>
> >>>>>>>>>>>> could
> >>>
> >>>> someone
> >>>>>>>>>>>
> >>>>>>>>>>>> help me understand "what" solr is locking in this case at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> "org.apache.solr.util.**AdjustableSemaphore.acquire(**
> >>> AdjustableSemaphore.java:61)"
> >>>
> >>>> when performing an update? That will help me understand where
> >>>>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>
> >>>> look
> >>>>>>>>>
> >>>>>>>>>> next.
> >>>>>>>>>>>
> >>>>>>>>>>>> 3) It seems all threads in this state are waiting for
> >>>>>>>>>>>>>
> >>>>>>>>>>>> "0x00000007216e68d8",
> >>>>>>>>>>>
> >>>>>>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
> >>>>>>>>>>>>> 4) Is there a limit to how many updates you can do in
> >>>>>>>>>>>>>
> >>>>>>>>>>>> SolrCloud?
> >>>
> >>>> 5) Wild-ass-theory: would more shards provide more locks
> >>>>>>>>>>>>>
> >>>>>>>>>>>> (whatever
> >>>>>
> >>>>>> they
> >>>>>>>>>
> >>>>>>>>>> are) on update, and thus more update throughput?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> To those interested, I've provided a stacktrace of 1 of 3
> nodes
> >>>>>>>>>>>>>
> >>>>>>>>>>>> at
> >>>>
> >>>>
>

Re: SolrCloud 4.x hangs under high update volume

Reply via email to