Hi Erick,

You're probably right about it not being a threading issue. In general it
seems that CPU contention could indeed be the issue.

Most of the settings we're using in Solr came "right out of the box"
including Jetty's configuration which specifies:

solr.jetty.threads.min: 10
solr.jetty.threads.max: 10000
solr.jetty.threads.idle.timeout: 5000
solr.jetty.threads.stop.timeout: 60000

The only interesting thing we're doing is disabling the query cache. This
is because individual hash-matching queries tend to be unique and therefore
don't benefit significantly from query caching.

On the GC side, I'm not really sure what to look for. Here's an example
message from /solr/logs/solr_gc.log

2016-12-28T13:48:56.872-0500: 9453.890: Total time for which application
threads were stopped: 0.8394383 seconds, Stopping threads took: 0.0004007
seconds
{Heap before GC invocations=8169 (full 124):
 par new generation   total 3495296K, used 3495296K [0x00000003c0000000,
0x00000004c0000000, 0x00000004c0000000)
  eden space 2796288K, 100% used [0x00000003c0000000, 0x000000046aac0000,
0x000000046aac0000)
  from space 699008K, 100% used [0x0000000495560000, 0x00000004c0000000,
0x00000004c0000000)
  to   space 699008K,   0% used [0x000000046aac0000, 0x000000046aac0000,
0x0000000495560000)
 concurrent mark-sweep generation total 12582912K, used 12111153K
[0x00000004c0000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 33470K, capacity 33998K, committed 34360K, reserved
1079296K
  class space    used 3716K, capacity 3888K, committed 3960K, reserved
1048576K
2016-12-28T13:48:57.415-0500: 9454.434: [GC (Allocation Failure)
2016-12-28T13:48:57.415-0500: 9454.434: [ParNew
Desired survivor size 644205768 bytes, new threshold 3 (max 8)
- age   1:  284566200 bytes,  284566200 total
- age   2:  197448288 bytes,  482014488 total
- age   3:  168306328 bytes,  650320816 total
- age   4:   48423744 bytes,  698744560 total
- age   5:   17038920 bytes,  715783480 total
: 3495296K->699008K(3495296K), 1.2399730 secs]
15606449K->13188910K(16078208K), 1.2403791 secs] [Times: user=4.60
sys=0.00, real=1.24 secs]

Is there something I should be grepping for in this enormous file?

Many thanks!

-Dave

On Wed, Dec 28, 2016 at 12:44 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Threads are usually a container parameter I think. True, Solr wants
> lots of threads. My return volley would be how busy is your CPU when
> this happens? If it's pegged more threads probably aren't really going
> to help. And if it's a GC issue then more threads would probably hurt.
>
> Best,
> Erick
>
> On Wed, Dec 28, 2016 at 9:14 AM, Dave Seltzer <dselt...@tveyes.com> wrote:
> > Hi Erick,
> >
> > I'll dig in on these timeout settings and see how changes affect
> behavior.
> >
> > One interesting aspect is that we're not indexing any content at the
> > moment. The rate of ingress is something like 10 to 20 documents per day.
> >
> > So my guess is that ZK simply is deciding that these servers are dead
> based
> > on the fact that responses are so very sluggish.
> >
> > You've mentioned lots of timeouts, but are there any settings which
> control
> > the number of available threads? Or is this something which is largely
> > handled automagically?
> >
> > Many thanks!
> >
> > -Dave
> >
> > On Wed, Dec 28, 2016 at 11:56 AM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> Dave:
> >>
> >> There are at least 4 timeouts (not even including ZK) that can
> >> be relevant, defined in solr.xml:
> >> socketTimeout
> >> connTimeout
> >> distribUpdateConnTimeout
> >> distribUpdateSoTimeout
> >>
> >> Plus the ZK timeout
> >> zkClientTimeout
> >>
> >> Plus the ZK configurations.
> >>
> >> So it would help narrow down what's going on if we knew why the nodes
> >> dropped out. There are indeed a lot of messages dumped, but somewhere
> >> in the logs there should be a root cause.
> >>
> >> You might see Leader Initiated Recovery (LIR) which can indicate that
> >> an update operation from the leader took too long, the timeouts above
> >> can be adjusted in this case.
> >>
> >> You might see evidence that ZK couldn't get a response from Solr in
> >> "too long" and decided it was gone.
> >>
> >> You might see...
> >>
> >> One thing I'd look at very closely is GC processing. One of the
> >> culprits for this behavior I've seen is a very long GC stop-the-world
> >> pause leading to ZK thinking the node is dead and tripping this chain.
> >> Depending on the timeouts, "very long" might be a few seconds.
> >>
> >> Not entirely helpful, but until you pinpoint why the node goes into
> >> recovery it's throwing darts at the wall. GC and log messages might
> >> give some insight into the root cause.
> >>
> >> Best,
> >> Erick
> >>
> >> On Wed, Dec 28, 2016 at 8:26 AM, Dave Seltzer <dselt...@tveyes.com>
> wrote:
> >> > Hello Everyone,
> >> >
> >> > I'm working on a Solr Cloud cluster which is used in a hash matching
> >> > application.
> >> >
> >> > For performance reasons we've opted to batch-execute hash matching
> >> queries.
> >> > This means that a single query will contain many nested queries. As
> you
> >> > might expect, these queries take a while to execute. (On the order of
> 5
> >> to
> >> > 10 seconds.)
> >> >
> >> > I've noticed that Solr will act erratically when we send too many
> >> > long-running queries. Specifically, heavily-loaded servers will
> >> repeatedly
> >> > fall out of the cluster and then recover. My theory is that there's
> some
> >> > limit on the number of concurrent connections and that client queries
> are
> >> > preventing zookeeper related queries... but I'm not sure. I've
> increased
> >> > ZKClientTimeout to combat this.
> >> >
> >> > My question is: What configuration settings should I be looking at in
> >> order
> >> > to make sure I'm maximizing the ability of Solr to handle concurrent
> >> > requests.
> >> >
> >> > Many thanks!
> >> >
> >> > -Dave
> >>
>

Reply via email to