Shawn,

Thanks for pointing me in the right direction. After consulting the
above document I *think* that the problem may be too large of a heap
and which may be affecting GC collection and hence causing ZK
timeouts.

We have around 20G of memory on these machines with a min/max of heap
at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for
aside for disk cache. Why did we choose 6-10? No other reason than we
wanted to allot enough for disk cache and then everything else was
thrown and Solr. Does this sound about right?

I took some screenshots for VisualVM and our NewRelic reporting as
well as some relevant portions of our SolrConfig.xml. Any
thoughts/comments would be greatly appreciated.

http://postimg.org/gallery/4t73sdks/1fc10f9c/

Thanks




On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey <s...@elyograg.org> wrote:
> On 3/22/2014 1:23 PM, Software Dev wrote:
>> We have 2 collections with 1 shard each replicated over 5 servers in the
>> cluster. We see a lot of flapping (down or recovering) on one of the
>> collections. When this happens the other collection hosted on the same
>> machine is still marked as active. When this happens it takes a fairly long
>> time (~30 minutes) for the collection to come back online, if at all. I
>> find that its usually more reliable to completely shutdown solr on the
>> affected machine and bring it back up with its core disabled. We then
>> re-enable the core when its marked as active.
>>
>> A few questions:
>>
>> 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing
>> that marks one collection as down but the other on the same machine as up?
>>
>> 2) Why does recovery take forever when a node goes down.. even if its only
>> down for 30 seconds. Our index is only 7-8G and we are running on SSD's.
>>
>> 3) What can be done to diagnose and fix this problem?
>
> Unless you are actually using the ping request handler, the healthcheck
> config will not matter.  Or were you referring to something else?
>
> Referencing the logs you included in your reply:  The EofException
> errors happen because your client code times out and disconnects before
> the request it made has completed.  That is most likely just a symptom
> that has nothing at all to do with the problem.
>
> Read the following wiki page.  What I'm going to say below will
> reference information you can find there:
>
> http://wiki.apache.org/solr/SolrPerformanceProblems
>
> Relevant side note: The default zookeeper client timeout is 15 seconds.
>  A typical zookeeper config defines tickTime as 2 seconds, and the
> timeout cannot be configured to be more than 20 times the tickTime,
> which means it cannot go beyond 40 seconds.  The default timeout value
> 15 seconds is usually more than enough, unless you are having
> performance problems.
>
> If you are not actually taking Solr instances down, then the fact that
> you are seeing the log replay messages indicates to me that something is
> taking so much time that the connection to Zookeeper times out.  When it
> finally responds, it will attempt to recover the index, which means
> first it will replay the transaction log and then it might replicate the
> index from the shard leader.
>
> Replaying the transaction log is likely the reason it takes so long to
> recover.  The wiki page I linked above has a "slow startup" section that
> explains how to fix this.
>
> There is some kind of underlying problem that is causing the zookeeper
> connection to timeout.  It is most likely garbage collection pauses or
> insufficient RAM to cache the index, possibly both.
>
> You did not indicate how much total RAM you have or how big your Java
> heap is.  As the wiki page mentions in the SSD section, SSD is not a
> substitute for having enough RAM to cache at significant percentage of
> your index.
>
> Thanks,
> Shawn
>

Reply via email to