Just a little update on my concurrency issue.

The problem I was having was that under heavy load individual Solr
instances would be slow to respond eventually leading to flapping cluster
membership.

I tweaked a bunch of settings in Linux, Jetty, Solr and within my
application but in the end none of these changes prevented the stability
issues I was having.

Instead, I modified my HAProxy config to limit the maximum simultaneous
number of connections on a per-server basis. By capping the number of
simultaneous queries being handled by Solr at 30 I've effectively prevented
long-running queries from stacking up and getting continually slower.
Instead, HAProxy is now queueing up the pending requests and letting them
in whenever there's available capacity. As a result Solr, behaves normally
under intense load and even though queries perform more slowly during these
times the it never results in runaway slowness.

My best guess as to why I ran into this issue is that perhaps my query
volume was large relative to the on-disk index size. As a result Solr
spends almost no time waiting on disk IO. This, perhaps, leaves the door
open for query-driven CPU utilization to cause more fundamental issues in
Solr's performance....

Or maybe I missed something stupid at the OS level.

Sigh.

Many thanks for all the help!

-Dave

On Wed, Dec 28, 2016 at 7:11 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> You'll see some lines with three different times in them, "user" "sys"
> and "real".
> The one that really counts is "real", that's the time that the process was
> stopped while GC went on. The "stop" in "Stop the world" (STW) GC
>
> What you're looking for is two things:
>
> 1> outrageously long times
> and/or
> 2> these happening one right after the other.
>
> For <2> I've seen situations where you go on to a STW pauses, collect
> a tiny bit of memory (say a few meg) and try to continue only to go
> right back into another. It might take, say, 2 seconds of "real" time to
> do the GC then go back into another 2 second cycle 500ms later. that kind
> of thing.
>
> GCViewer can help you make sense of the GC logs
> https://sourceforge.net/projects/gcviewer/
>
> Unfortunately GC tuning is "more art than science" ;(
>
> Best,
> Erick
>
> Best,
> Erick
>
> On Wed, Dec 28, 2016 at 10:57 AM, Dave Seltzer <dselt...@tveyes.com>
> wrote:
> > Hi Erick,
> >
> > You're probably right about it not being a threading issue. In general it
> > seems that CPU contention could indeed be the issue.
> >
> > Most of the settings we're using in Solr came "right out of the box"
> > including Jetty's configuration which specifies:
> >
> > solr.jetty.threads.min: 10
> > solr.jetty.threads.max: 10000
> > solr.jetty.threads.idle.timeout: 5000
> > solr.jetty.threads.stop.timeout: 60000
> >
> > The only interesting thing we're doing is disabling the query cache. This
> > is because individual hash-matching queries tend to be unique and
> therefore
> > don't benefit significantly from query caching.
> >
> > On the GC side, I'm not really sure what to look for. Here's an example
> > message from /solr/logs/solr_gc.log
> >
> > 2016-12-28T13:48:56.872-0500: 9453.890: Total time for which application
> > threads were stopped: 0.8394383 seconds, Stopping threads took: 0.0004007
> > seconds
> > {Heap before GC invocations=8169 (full 124):
> >  par new generation   total 3495296K, used 3495296K [0x00000003c0000000,
> > 0x00000004c0000000, 0x00000004c0000000)
> >   eden space 2796288K, 100% used [0x00000003c0000000, 0x000000046aac0000,
> > 0x000000046aac0000)
> >   from space 699008K, 100% used [0x0000000495560000, 0x00000004c0000000,
> > 0x00000004c0000000)
> >   to   space 699008K,   0% used [0x000000046aac0000, 0x000000046aac0000,
> > 0x0000000495560000)
> >  concurrent mark-sweep generation total 12582912K, used 12111153K
> > [0x00000004c0000000, 0x00000007c0000000, 0x00000007c0000000)
> >  Metaspace       used 33470K, capacity 33998K, committed 34360K, reserved
> > 1079296K
> >   class space    used 3716K, capacity 3888K, committed 3960K, reserved
> > 1048576K
> > 2016-12-28T13:48:57.415-0500: 9454.434: [GC (Allocation Failure)
> > 2016-12-28T13:48:57.415-0500: 9454.434: [ParNew
> > Desired survivor size 644205768 bytes, new threshold 3 (max 8)
> > - age   1:  284566200 bytes,  284566200 total
> > - age   2:  197448288 bytes,  482014488 total
> > - age   3:  168306328 bytes,  650320816 total
> > - age   4:   48423744 bytes,  698744560 total
> > - age   5:   17038920 bytes,  715783480 total
> > : 3495296K->699008K(3495296K), 1.2399730 secs]
> > 15606449K->13188910K(16078208K), 1.2403791 secs] [Times: user=4.60
> > sys=0.00, real=1.24 secs]
> >
> > Is there something I should be grepping for in this enormous file?
> >
> > Many thanks!
> >
> > -Dave
> >
> > On Wed, Dec 28, 2016 at 12:44 PM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> Threads are usually a container parameter I think. True, Solr wants
> >> lots of threads. My return volley would be how busy is your CPU when
> >> this happens? If it's pegged more threads probably aren't really going
> >> to help. And if it's a GC issue then more threads would probably hurt.
> >>
> >> Best,
> >> Erick
> >>
> >> On Wed, Dec 28, 2016 at 9:14 AM, Dave Seltzer <dselt...@tveyes.com>
> wrote:
> >> > Hi Erick,
> >> >
> >> > I'll dig in on these timeout settings and see how changes affect
> >> behavior.
> >> >
> >> > One interesting aspect is that we're not indexing any content at the
> >> > moment. The rate of ingress is something like 10 to 20 documents per
> day.
> >> >
> >> > So my guess is that ZK simply is deciding that these servers are dead
> >> based
> >> > on the fact that responses are so very sluggish.
> >> >
> >> > You've mentioned lots of timeouts, but are there any settings which
> >> control
> >> > the number of available threads? Or is this something which is largely
> >> > handled automagically?
> >> >
> >> > Many thanks!
> >> >
> >> > -Dave
> >> >
> >> > On Wed, Dec 28, 2016 at 11:56 AM, Erick Erickson <
> >> erickerick...@gmail.com>
> >> > wrote:
> >> >
> >> >> Dave:
> >> >>
> >> >> There are at least 4 timeouts (not even including ZK) that can
> >> >> be relevant, defined in solr.xml:
> >> >> socketTimeout
> >> >> connTimeout
> >> >> distribUpdateConnTimeout
> >> >> distribUpdateSoTimeout
> >> >>
> >> >> Plus the ZK timeout
> >> >> zkClientTimeout
> >> >>
> >> >> Plus the ZK configurations.
> >> >>
> >> >> So it would help narrow down what's going on if we knew why the nodes
> >> >> dropped out. There are indeed a lot of messages dumped, but somewhere
> >> >> in the logs there should be a root cause.
> >> >>
> >> >> You might see Leader Initiated Recovery (LIR) which can indicate that
> >> >> an update operation from the leader took too long, the timeouts above
> >> >> can be adjusted in this case.
> >> >>
> >> >> You might see evidence that ZK couldn't get a response from Solr in
> >> >> "too long" and decided it was gone.
> >> >>
> >> >> You might see...
> >> >>
> >> >> One thing I'd look at very closely is GC processing. One of the
> >> >> culprits for this behavior I've seen is a very long GC stop-the-world
> >> >> pause leading to ZK thinking the node is dead and tripping this
> chain.
> >> >> Depending on the timeouts, "very long" might be a few seconds.
> >> >>
> >> >> Not entirely helpful, but until you pinpoint why the node goes into
> >> >> recovery it's throwing darts at the wall. GC and log messages might
> >> >> give some insight into the root cause.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Wed, Dec 28, 2016 at 8:26 AM, Dave Seltzer <dselt...@tveyes.com>
> >> wrote:
> >> >> > Hello Everyone,
> >> >> >
> >> >> > I'm working on a Solr Cloud cluster which is used in a hash
> matching
> >> >> > application.
> >> >> >
> >> >> > For performance reasons we've opted to batch-execute hash matching
> >> >> queries.
> >> >> > This means that a single query will contain many nested queries. As
> >> you
> >> >> > might expect, these queries take a while to execute. (On the order
> of
> >> 5
> >> >> to
> >> >> > 10 seconds.)
> >> >> >
> >> >> > I've noticed that Solr will act erratically when we send too many
> >> >> > long-running queries. Specifically, heavily-loaded servers will
> >> >> repeatedly
> >> >> > fall out of the cluster and then recover. My theory is that there's
> >> some
> >> >> > limit on the number of concurrent connections and that client
> queries
> >> are
> >> >> > preventing zookeeper related queries... but I'm not sure. I've
> >> increased
> >> >> > ZKClientTimeout to combat this.
> >> >> >
> >> >> > My question is: What configuration settings should I be looking at
> in
> >> >> order
> >> >> > to make sure I'm maximizing the ability of Solr to handle
> concurrent
> >> >> > requests.
> >> >> >
> >> >> > Many thanks!
> >> >> >
> >> >> > -Dave
> >> >>
> >>
>

Reply via email to