Threads are usually a container parameter I think. True, Solr wants
lots of threads. My return volley would be how busy is your CPU when
this happens? If it's pegged more threads probably aren't really going
to help. And if it's a GC issue then more threads would probably hurt.

Best,
Erick

On Wed, Dec 28, 2016 at 9:14 AM, Dave Seltzer <dselt...@tveyes.com> wrote:
> Hi Erick,
>
> I'll dig in on these timeout settings and see how changes affect behavior.
>
> One interesting aspect is that we're not indexing any content at the
> moment. The rate of ingress is something like 10 to 20 documents per day.
>
> So my guess is that ZK simply is deciding that these servers are dead based
> on the fact that responses are so very sluggish.
>
> You've mentioned lots of timeouts, but are there any settings which control
> the number of available threads? Or is this something which is largely
> handled automagically?
>
> Many thanks!
>
> -Dave
>
> On Wed, Dec 28, 2016 at 11:56 AM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Dave:
>>
>> There are at least 4 timeouts (not even including ZK) that can
>> be relevant, defined in solr.xml:
>> socketTimeout
>> connTimeout
>> distribUpdateConnTimeout
>> distribUpdateSoTimeout
>>
>> Plus the ZK timeout
>> zkClientTimeout
>>
>> Plus the ZK configurations.
>>
>> So it would help narrow down what's going on if we knew why the nodes
>> dropped out. There are indeed a lot of messages dumped, but somewhere
>> in the logs there should be a root cause.
>>
>> You might see Leader Initiated Recovery (LIR) which can indicate that
>> an update operation from the leader took too long, the timeouts above
>> can be adjusted in this case.
>>
>> You might see evidence that ZK couldn't get a response from Solr in
>> "too long" and decided it was gone.
>>
>> You might see...
>>
>> One thing I'd look at very closely is GC processing. One of the
>> culprits for this behavior I've seen is a very long GC stop-the-world
>> pause leading to ZK thinking the node is dead and tripping this chain.
>> Depending on the timeouts, "very long" might be a few seconds.
>>
>> Not entirely helpful, but until you pinpoint why the node goes into
>> recovery it's throwing darts at the wall. GC and log messages might
>> give some insight into the root cause.
>>
>> Best,
>> Erick
>>
>> On Wed, Dec 28, 2016 at 8:26 AM, Dave Seltzer <dselt...@tveyes.com> wrote:
>> > Hello Everyone,
>> >
>> > I'm working on a Solr Cloud cluster which is used in a hash matching
>> > application.
>> >
>> > For performance reasons we've opted to batch-execute hash matching
>> queries.
>> > This means that a single query will contain many nested queries. As you
>> > might expect, these queries take a while to execute. (On the order of 5
>> to
>> > 10 seconds.)
>> >
>> > I've noticed that Solr will act erratically when we send too many
>> > long-running queries. Specifically, heavily-loaded servers will
>> repeatedly
>> > fall out of the cluster and then recover. My theory is that there's some
>> > limit on the number of concurrent connections and that client queries are
>> > preventing zookeeper related queries... but I'm not sure. I've increased
>> > ZKClientTimeout to combat this.
>> >
>> > My question is: What configuration settings should I be looking at in
>> order
>> > to make sure I'm maximizing the ability of Solr to handle concurrent
>> > requests.
>> >
>> > Many thanks!
>> >
>> > -Dave
>>

Reply via email to