Lol, at breaking during a demo - always the way it is! :) I agree, we are just tip-toeing around the issue, but waiting for 4.5 is definitely an option if we "get-by" for now in testing; patched Solr versions seem to make people uneasy sometimes :).
Seeing there seems to be some danger to SOLR-5216 (in some ways it blows up worse due to less limitations on thread), I'm guessing only SOLR-5232 and SOLR-4816 are making it into 4.5? I feel those 2 in combination will make a world of difference! Thanks so much again guys! Tim On 12 September 2013 03:43, Erick Erickson <erickerick...@gmail.com> wrote: > Fewer client threads updating makes sense, and going to 1 core also seems > like it might help. But it's all a crap-shoot unless the underlying cause > gets fixed up. Both would improve things, but you'll still hit the problem > sometime, probably when doing a demo for your boss ;). > > Adrien has branched the code for SOLR 4.5 in preparation for a release > candidate tentatively scheduled for next week. You might just start working > with that branch if you can rather than apply individual patches... > > I suspect there'll be a couple more changes to this code (looks like > Shikhar already raised an issue for instance) before 4.5 is finally cut... > > FWIW, > Erick > > > > On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt <t...@elementspace.com > >wrote: > > > Thanks Erick! > > > > Yeah, I think the next step will be CloudSolrServer with the SOLR-4816 > > patch. I think that is a very, very useful patch by the way. SOLR-5232 > > seems promising as well. > > > > I see your point on the more-shards idea, this is obviously a > > global/instance-level lock. If I really had to, I suppose I could run > more > > Solr instances to reduce locking then? Currently I have 2 cores per > > instance and I could go 1-to-1 to simplify things. > > > > The good news is we seem to be more stable since changing to a bigger > > client->solr batch-size and fewer client threads updating. > > > > Cheers, > > > > Tim > > > > On 11/09/13 04:19 AM, Erick Erickson wrote: > > > >> If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent > >> copy of the 4x branch. By "recent", I mean like today, it looks like > Mark > >> applied this early this morning. But several reports indicate that this > >> will > >> solve your problem. > >> > >> I would expect that increasing the number of shards would make the > problem > >> worse, not > >> better. > >> > >> There's also SOLR-5232... > >> > >> Best > >> Erick > >> > >> > >> On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt<tim@elementspace. > **com<t...@elementspace.com> > >> >wrote: > >> > >> Hey guys, > >>> > >>> Based on my understanding of the problem we are encountering, I feel > >>> we've > >>> been able to reduce the likelihood of this issue by making the > following > >>> changes to our app's usage of SolrCloud: > >>> > >>> 1) We increased our document batch size to 200 from 10 - our app > batches > >>> updates to reduce HTTP requests/overhead. The theory is increasing the > >>> batch size reduces the likelihood of this issue happening. > >>> 2) We reduced to 1 application node sending updates to SolrCloud - we > >>> write > >>> Solr updates to Redis, and have previously had 4 application nodes > >>> pushing > >>> the updates to Solr (popping off the Redis queue). Reducing the number > of > >>> nodes pushing to Solr reduces the concurrency on SolrCloud. > >>> 3) Less threads pushing to SolrCloud - due to the increase in batch > size, > >>> we were able to go down to 5 update threads on the update-pushing-app > >>> (from > >>> 10 threads). > >>> > >>> To be clear the above only reduces the likelihood of the issue > happening, > >>> and DOES NOT actually resolve the issue at hand. > >>> > >>> If we happen to encounter issues with the above 3 changes, the next > steps > >>> (I could use some advice on) are: > >>> > >>> 1) Increase the number of shards (2x) - the theory here is this reduces > >>> the > >>> locking on shards because there are more shards. Am I onto something > >>> here, > >>> or will this not help at all? > >>> 2) Use CloudSolrServer - currently we have a plain-old least-connection > >>> HTTP VIP. If we go "direct" to what we need to update, this will reduce > >>> concurrency in SolrCloud a bit. Thoughts? > >>> > >>> Thanks all! > >>> > >>> Cheers, > >>> > >>> Tim > >>> > >>> > >>> On 6 September 2013 14:47, Tim Vaillancourt<tim@elementspace.**com< > t...@elementspace.com>> > >>> wrote: > >>> > >>> Enjoy your trip, Mark! Thanks again for the help! > >>>> > >>>> Tim > >>>> > >>>> > >>>> On 6 September 2013 14:18, Mark Miller<markrmil...@gmail.com> wrote: > >>>> > >>>> Okay, thanks, useful info. Getting on a plane, but ill look more at > >>>>> this > >>>>> soon. That 10k thread spike is good to know - that's no good and > could > >>>>> easily be part of the problem. We want to keep that from happening. > >>>>> > >>>>> Mark > >>>>> > >>>>> Sent from my iPhone > >>>>> > >>>>> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt<tim@elementspace.**com< > t...@elementspace.com> > >>>>> > > >>>>> wrote: > >>>>> > >>>>> Hey Mark, > >>>>>> > >>>>>> The farthest we've made it at the same batch size/volume was 12 > hours > >>>>>> without this patch, but that isn't consistent. Sometimes we would > only > >>>>>> > >>>>> get > >>>>> > >>>>>> to 6 hours or less. > >>>>>> > >>>>>> During the crash I can see an amazing spike in threads to 10k which > is > >>>>>> essentially our ulimit for the JVM, but I strangely see no > >>>>>> > >>>>> "OutOfMemory: > >>> > >>>> cannot open native thread errors" that always follow this. Weird! > >>>>>> > >>>>>> We also notice a spike in CPU around the crash. The instability > caused > >>>>>> > >>>>> some > >>>>> > >>>>>> shard recovery/replication though, so that CPU may be a symptom of > the > >>>>>> replication, or is possibly the root cause. The CPU spikes from > about > >>>>>> 20-30% utilization (system + user) to 60% fairly sharply, so the > CPU, > >>>>>> > >>>>> while > >>>>> > >>>>>> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons, > >>>>>> > >>>>> whole > >>>>> > >>>>>> index is in 128GB RAM, 6xRAID10 15k). > >>>>>> > >>>>>> More on resources: our disk I/O seemed to spike about 2x during the > >>>>>> > >>>>> crash > >>>>> > >>>>>> (about 1300kbps written to 3500kbps), but this may have been the > >>>>>> replication, or ERROR logging (we generally log nothing due to > >>>>>> WARN-severity unless something breaks). > >>>>>> > >>>>>> Lastly, I found this stack trace occurring frequently, and have no > >>>>>> > >>>>> idea > >>> > >>>> what it is (may be useful or not): > >>>>>> > >>>>>> "java.lang.**IllegalStateException : > >>>>>> at > >>>>>> > >>>>> org.eclipse.jetty.server.**Response.resetBuffer(Response.**java:964) > >>> > >>>> at org.eclipse.jetty.server.**Response.sendError(Response.** > >>>>>> java:325) > >>>>>> at > >>>>>> > >>>>>> org.apache.solr.servlet.**SolrDispatchFilter.sendError(** > >>> SolrDispatchFilter.java:692) > >>> > >>>> at > >>>>>> > >>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** > >>> SolrDispatchFilter.java:380) > >>> > >>>> at > >>>>>> > >>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** > >>> SolrDispatchFilter.java:155) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.** > >>> doFilter(ServletHandler.java:**1423) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(** > >>> ServletHandler.java:450) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** > >>> ScopedHandler.java:138) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(** > >>> SecurityHandler.java:564) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**session.SessionHandler.** > >>> doHandle(SessionHandler.java:**213) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** > >>> doHandle(ContextHandler.java:**1083) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(** > >>> ServletHandler.java:379) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**session.SessionHandler.** > >>> doScope(SessionHandler.java:**175) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** > >>> doScope(ContextHandler.java:**1017) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** > >>> ScopedHandler.java:136) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.** > >>> handle(**ContextHandlerCollection.java:**258) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.** > >>> handle(HandlerCollection.java:**109) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(** > >>> HandlerWrapper.java:97) > >>> > >>>> at org.eclipse.jetty.server.**Server.handle(Server.java:445) > >>>>>> at > >>>>>> > >>>>> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260) > >>>>> > >>>>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(** > >>> HttpConnection.java:225) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(** > >>> AbstractConnection.java:358) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(** > >>> QueuedThreadPool.java:596) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(** > >>> QueuedThreadPool.java:527) > >>> > >>>> at java.lang.Thread.run(Thread.**java:724)" > >>>>>> > >>>>>> On your live_nodes question, I don't have historical data on this > from > >>>>>> > >>>>> when > >>>>> > >>>>>> the crash occurred, which I guess is what you're looking for. I > could > >>>>>> > >>>>> add > >>>>> > >>>>>> this to our monitoring for future tests, however. I'd be glad to > >>>>>> > >>>>> continue > >>>>> > >>>>>> further testing, but I think first more monitoring is needed to > >>>>>> > >>>>> understand > >>>>> > >>>>>> this further. Could we come up with a list of metrics that would be > >>>>>> > >>>>> useful > >>>>> > >>>>>> to see following another test and successful crash? > >>>>>> > >>>>>> Metrics needed: > >>>>>> > >>>>>> 1) # of live_nodes. > >>>>>> 2) Full stack traces. > >>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide). > >>>>>> 4) Solr's JVM thread count (already done) > >>>>>> 5) ? > >>>>>> > >>>>>> Cheers, > >>>>>> > >>>>>> Tim Vaillancourt > >>>>>> > >>>>>> > >>>>>> On 6 September 2013 13:11, Mark Miller<markrmil...@gmail.com> > wrote: > >>>>>> > >>>>>> Did you ever get to index that long before without hitting the > >>>>>>> > >>>>>> deadlock? > >>>>> > >>>>>> There really isn't anything negative the patch could be introducing, > >>>>>>> > >>>>>> other > >>>>> > >>>>>> than allowing for some more threads to possibly run at once. If I > had > >>>>>>> > >>>>>> to > >>>>> > >>>>>> guess, I would say its likely this patch fixes the deadlock issue > and > >>>>>>> > >>>>>> your > >>>>> > >>>>>> seeing another issue - which looks like the system cannot keep up > >>>>>>> > >>>>>> with > >>> > >>>> the > >>>>> > >>>>>> requests or something for some reason - perhaps due to some OS > >>>>>>> > >>>>>> networking > >>>>> > >>>>>> settings or something (more guessing). Connection refused happens > >>>>>>> > >>>>>> generally > >>>>> > >>>>>> when there is nothing listening on the port. > >>>>>>> > >>>>>>> Do you see anything interesting change with the rest of the system? > >>>>>>> > >>>>>> CPU > >>> > >>>> usage spikes or something like that? > >>>>>>> > >>>>>>> Clamping down further on the overall number of threads night help > >>>>>>> > >>>>>> (which > >>>>> > >>>>>> would require making something configurable). How many nodes are > >>>>>>> > >>>>>> listed in > >>>>> > >>>>>> zk under live_nodes? > >>>>>>> > >>>>>>> Mark > >>>>>>> > >>>>>>> Sent from my iPhone > >>>>>>> > >>>>>>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt<tim@elementspace. > **com<t...@elementspace.com> > >>>>>>> > > >>>>>>> wrote: > >>>>>>> > >>>>>>> Hey guys, > >>>>>>>> > >>>>>>>> (copy of my post to SOLR-5216) > >>>>>>>> > >>>>>>>> We tested this patch and unfortunately encountered some serious > >>>>>>>> > >>>>>>> issues a > >>>>> > >>>>>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so > >>>>>>>> > >>>>>>> we > >>>>> > >>>>>> are > >>>>>>> > >>>>>>>> writing about 5000 docs/sec total, using autoCommit to commit the > >>>>>>>> > >>>>>>> updates > >>>>> > >>>>>> (no explicit commits). > >>>>>>>> > >>>>>>>> Our environment: > >>>>>>>> > >>>>>>>> Solr 4.3.1 w/SOLR-5216 patch. > >>>>>>>> Jetty 9, Java 1.7. > >>>>>>>> 3 solr instances, 1 per physical server. > >>>>>>>> 1 collection. > >>>>>>>> 3 shards. > >>>>>>>> 2 replicas (each instance is a leader and a replica). > >>>>>>>> Soft autoCommit is 1000ms. > >>>>>>>> Hard autoCommit is 15000ms. > >>>>>>>> > >>>>>>>> After about 6 hours of stress-testing this patch, we see many of > >>>>>>>> > >>>>>>> these > >>> > >>>> stalled transactions (below), and the Solr instances start to see > >>>>>>>> > >>>>>>> each > >>> > >>>> other as down, flooding our Solr logs with "Connection Refused" > >>>>>>>> > >>>>>>> exceptions, > >>>>>>> > >>>>>>>> and otherwise no obviously-useful logs that I could see. > >>>>>>>> > >>>>>>>> I did notice some stalled transactions on both /select and > /update, > >>>>>>>> however. This never occurred without this patch. > >>>>>>>> > >>>>>>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC > >>>>>>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 > >>>>>>>> > >>>>>>>> Lastly, I have a summary of the ERROR-severity logs from this > >>>>>>>> > >>>>>>> 24-hour > >>> > >>>> soak. > >>>>>>> > >>>>>>>> My script "normalizes" the ERROR-severity stack traces and returns > >>>>>>>> > >>>>>>> them > >>>>> > >>>>>> in > >>>>>>> > >>>>>>>> order of occurrence. > >>>>>>>> > >>>>>>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb > >>>>>>>> > >>>>>>>> Thanks! > >>>>>>>> > >>>>>>>> Tim Vaillancourt > >>>>>>>> > >>>>>>>> > >>>>>>>> On 6 September 2013 07:27, Markus Jelsma< > >>>>>>>> > >>>>>>> markus.jel...@openindex.io> > >>> > >>>> wrote: > >>>>>>> > >>>>>>>> Thanks! > >>>>>>>>> > >>>>>>>>> -----Original message----- > >>>>>>>>> > >>>>>>>>>> From:Erick Erickson<erickerickson@gmail.**com< > erickerick...@gmail.com> > >>>>>>>>>> > > >>>>>>>>>> Sent: Friday 6th September 2013 16:20 > >>>>>>>>>> To: solr-user@lucene.apache.org > >>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume > >>>>>>>>>> > >>>>>>>>>> Markus: > >>>>>>>>>> > >>>>>>>>>> See: https://issues.apache.org/**jira/browse/SOLR-5216< > https://issues.apache.org/jira/browse/SOLR-5216> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma > >>>>>>>>>> <markus.jel...@openindex.io>**wrote: > >>>>>>>>>> > >>>>>>>>>> Hi Mark, > >>>>>>>>>>> > >>>>>>>>>>> Got an issue to watch? > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Markus > >>>>>>>>>>> > >>>>>>>>>>> -----Original message----- > >>>>>>>>>>> > >>>>>>>>>>>> From:Mark Miller<markrmil...@gmail.com> > >>>>>>>>>>>> Sent: Wednesday 4th September 2013 16:55 > >>>>>>>>>>>> To: solr-user@lucene.apache.org > >>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume > >>>>>>>>>>>> > >>>>>>>>>>>> I'm going to try and fix the root cause for 4.5 - I've > suspected > >>>>>>>>>>>> > >>>>>>>>>>> what it > >>>>>>>>> > >>>>>>>>>> is since early this year, but it's never personally been an > >>>>>>>>>>> > >>>>>>>>>> issue, > >>> > >>>> so > >>>>> > >>>>>> it's > >>>>>>>>> > >>>>>>>>>> rolled along for a long time. > >>>>>>>>>>> > >>>>>>>>>>>> Mark > >>>>>>>>>>>> > >>>>>>>>>>>> Sent from my iPhone > >>>>>>>>>>>> > >>>>>>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt< > >>>>>>>>>>>> > >>>>>>>>>>> t...@elementspace.com> > >>>>> > >>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hey guys, > >>>>>>>>>>>>> > >>>>>>>>>>>>> I am looking into an issue we've been having with SolrCloud > >>>>>>>>>>>>> > >>>>>>>>>>>> since > >>> > >>>> the > >>>>>>>>> > >>>>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't > >>>>>>>>>>>>> > >>>>>>>>>>>> tested > >>>>>>>>> > >>>>>>>>>> 4.4.0 > >>>>>>>>>>> > >>>>>>>>>>>> yet). I've noticed other users with this same issue, so I'd > >>>>>>>>>>>>> > >>>>>>>>>>>> really > >>>>> > >>>>>> like to > >>>>>>>>>>> > >>>>>>>>>>>> get to the bottom of it. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Under a very, very high rate of updates (2000+/sec), after > 1-12 > >>>>>>>>>>>>> > >>>>>>>>>>>> hours > >>>>>>>>> > >>>>>>>>>> we > >>>>>>>>>>> > >>>>>>>>>>>> see stalled transactions that snowball to consume all Jetty > >>>>>>>>>>>>> > >>>>>>>>>>>> threads in > >>>>>>>>> > >>>>>>>>>> the > >>>>>>>>>>> > >>>>>>>>>>>> JVM. This eventually causes the JVM to hang with most threads > >>>>>>>>>>>>> > >>>>>>>>>>>> waiting > >>>>>>>>> > >>>>>>>>>> on > >>>>>>>>>>> > >>>>>>>>>>>> the condition/stack provided at the bottom of this message. At > >>>>>>>>>>>>> > >>>>>>>>>>>> this > >>>>> > >>>>>> point > >>>>>>>>>>> > >>>>>>>>>>>> SolrCloud instances then start to see their neighbors (who > also > >>>>>>>>>>>>> > >>>>>>>>>>>> have > >>>>>>>>> > >>>>>>>>>> all > >>>>>>>>>>> > >>>>>>>>>>>> threads hung) as down w/"Connection Refused", and the shards > >>>>>>>>>>>>> > >>>>>>>>>>>> become > >>>>> > >>>>>> "down" > >>>>>>>>>>> > >>>>>>>>>>>> in state. Sometimes a node or two survives and just returns > >>>>>>>>>>>>> > >>>>>>>>>>>> 503s > >>> > >>>> "no > >>>>>>>>> > >>>>>>>>>> server > >>>>>>>>>>> > >>>>>>>>>>>> hosting shard" errors. > >>>>>>>>>>>>> > >>>>>>>>>>>>> As a workaround/experiment, we have tuned the number of > threads > >>>>>>>>>>>>> > >>>>>>>>>>>> sending > >>>>>>>>> > >>>>>>>>>> updates to Solr, as well as the batch size (we batch updates > >>>>>>>>>>>>> > >>>>>>>>>>>> from > >>> > >>>> client -> > >>>>>>>>>>> > >>>>>>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning > >>>>>>>>>>>>> > >>>>>>>>>>>> off > >>>>> > >>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also > >>>>>>>>>>>>> > >>>>>>>>>>>> did not > >>>>>>>>> > >>>>>>>>>> help. Certain combinations of update threads and batch sizes > >>>>>>>>>>>>> > >>>>>>>>>>>> seem > >>> > >>>> to > >>>>>>>>> > >>>>>>>>>> mask/help the problem, but not resolve it entirely. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Our current environment is the following: > >>>>>>>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > >>>>>>>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM. > >>>>>>>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader > of > >>>>>>>>>>>>> > >>>>>>>>>>>> 1 > >>> > >>>> shard > >>>>>>>>> > >>>>>>>>>> and > >>>>>>>>>>> > >>>>>>>>>>>> a replica of 1 shard). > >>>>>>>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no > >>>>>>>>>>>>> > >>>>>>>>>>>> movement > >>> > >>>> on a > >>>>>>>>> > >>>>>>>>>> good > >>>>>>>>>>> > >>>>>>>>>>>> day. > >>>>>>>>>>>>> - 5000 max jetty threads (well above what we use when we are > >>>>>>>>>>>>> > >>>>>>>>>>>> healthy), > >>>>>>>>> > >>>>>>>>>> Linux-user threads ulimit is 6000. > >>>>>>>>>>>>> - Occurs under Jetty 8 or 9 (many versions). > >>>>>>>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions). > >>>>>>>>>>>>> - Occurs under several JVM tunings. > >>>>>>>>>>>>> - Everything seems to point to Solr itself, and not a Jetty > or > >>>>>>>>>>>>> > >>>>>>>>>>>> Java > >>>>> > >>>>>> version > >>>>>>>>>>> > >>>>>>>>>>>> (I hope I'm wrong). > >>>>>>>>>>>>> > >>>>>>>>>>>>> The stack trace that is holding up all my Jetty QTP threads > is > >>>>>>>>>>>>> > >>>>>>>>>>>> the > >>>>> > >>>>>> following, which seems to be waiting on a lock that I would > >>>>>>>>>>>>> > >>>>>>>>>>>> very > >>> > >>>> much > >>>>>>>>> > >>>>>>>>>> like > >>>>>>>>>>> > >>>>>>>>>>>> to understand further: > >>>>>>>>>>>>> > >>>>>>>>>>>>> "java.lang.Thread.State: WAITING (parking) > >>>>>>>>>>>>> at sun.misc.Unsafe.park(Native Method) > >>>>>>>>>>>>> - parking to wait for<0x00000007216e68d8> (a > >>>>>>>>>>>>> java.util.concurrent.**Semaphore$NonfairSync) > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> java.util.concurrent.locks.**LockSupport.park(LockSupport.** > >>>>>>>>> java:186) > >>>>>>>>> > >>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.** > >>> parkAndCheckInterrupt(**AbstractQueuedSynchronizer.**java:834) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.** > >>> doAcquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:994) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.** > >>> acquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:1303) > >>> > >>>> at java.util.concurrent.**Semaphore.acquire(Semaphore.**java:317) > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.apache.solr.util.**AdjustableSemaphore.acquire(** > >>> AdjustableSemaphore.java:61) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(** > >>> SolrCmdDistributor.java:418) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(** > >>> SolrCmdDistributor.java:368) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.flushAdds(** > >>> SolrCmdDistributor.java:300) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.finish(** > >>> SolrCmdDistributor.java:96) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.apache.solr.update.**processor.** > >>> DistributedUpdateProcessor.**doFinish(**DistributedUpdateProcessor.** > >>> java:462) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.apache.solr.update.**processor.** > >>> DistributedUpdateProcessor.**finish(**DistributedUpdateProcessor.** > >>> java:1178) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.apache.solr.handler.**ContentStreamHandlerBase.** > >>> handleRequestBody(**ContentStreamHandlerBase.java:**83) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> > org.apache.solr.handler.**RequestHandlerBase.**handleRequest(** > >>> RequestHandlerBase.java:135) > >>> > >>>> at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1820) > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.execute(** > >>> SolrDispatchFilter.java:656) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** > >>> SolrDispatchFilter.java:359) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** > >>> SolrDispatchFilter.java:155) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.** > >>> doFilter(ServletHandler.java:**1486) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(** > >>> ServletHandler.java:503) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** > >>> ScopedHandler.java:138) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(** > >>> SecurityHandler.java:564) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.** > >>> doHandle(SessionHandler.java:**213) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** > >>> doHandle(ContextHandler.java:**1096) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(** > >>> ServletHandler.java:432) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.** > >>> doScope(SessionHandler.java:**175) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** > >>> doScope(ContextHandler.java:**1030) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** > >>> ScopedHandler.java:136) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> > org.eclipse.jetty.server.**handler.**ContextHandlerCollection.* > >>> *handle(**ContextHandlerCollection.java:**201) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.** > >>> handle(HandlerCollection.java:**109) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(** > >>> HandlerWrapper.java:97) > >>> > >>>> at org.eclipse.jetty.server.**Server.handle(Server.java:445) > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.server.**HttpChannel.handle(** > >>>>>>>>> HttpChannel.java:268) > >>>>>>>>> > >>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(** > >>> HttpConnection.java:229) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> > org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(** > >>> AbstractConnection.java:358) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(** > >>> QueuedThreadPool.java:601) > >>> > >>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(** > >>> QueuedThreadPool.java:532) > >>> > >>>> at java.lang.Thread.run(Thread.**java:724)" > >>>>>>>>>>>>> > >>>>>>>>>>>>> Some questions I had were: > >>>>>>>>>>>>> 1) What exclusive locks does SolrCloud "make" when performing > >>>>>>>>>>>>> > >>>>>>>>>>>> an > >>> > >>>> update? > >>>>>>>>>>> > >>>>>>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D), > >>>>>>>>>>>>> > >>>>>>>>>>>> could > >>> > >>>> someone > >>>>>>>>>>> > >>>>>>>>>>>> help me understand "what" solr is locking in this case at > >>>>>>>>>>>>> > >>>>>>>>>>>> "org.apache.solr.util.**AdjustableSemaphore.acquire(** > >>> AdjustableSemaphore.java:61)" > >>> > >>>> when performing an update? That will help me understand where > >>>>>>>>>>>>> > >>>>>>>>>>>> to > >>> > >>>> look > >>>>>>>>> > >>>>>>>>>> next. > >>>>>>>>>>> > >>>>>>>>>>>> 3) It seems all threads in this state are waiting for > >>>>>>>>>>>>> > >>>>>>>>>>>> "0x00000007216e68d8", > >>>>>>>>>>> > >>>>>>>>>>>> is there a way to tell what "0x00000007216e68d8" is? > >>>>>>>>>>>>> 4) Is there a limit to how many updates you can do in > >>>>>>>>>>>>> > >>>>>>>>>>>> SolrCloud? > >>> > >>>> 5) Wild-ass-theory: would more shards provide more locks > >>>>>>>>>>>>> > >>>>>>>>>>>> (whatever > >>>>> > >>>>>> they > >>>>>>>>> > >>>>>>>>>> are) on update, and thus more update throughput? > >>>>>>>>>>>>> > >>>>>>>>>>>>> To those interested, I've provided a stacktrace of 1 of 3 > nodes > >>>>>>>>>>>>> > >>>>>>>>>>>> at > >>>> > >>>> >