Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

Neil Prosser Mon, 22 Jul 2013 05:44:53 -0700

Sorry, I should also mention that these leader nodes which are marked as
down can actually still be queried locally with distrib=false with no
problems. Is it possible that they've somehow got themselves out-of-sync?



On 22 July 2013 13:37, Neil Prosser <neil.pros...@gmail.com> wrote:

> No need to apologise. It's always good to have things like that reiterated
> in case I've misunderstood along the way.
>
> I have a feeling that it's related to garbage collection. I assume that if
> the JVM heads into a stop-the-world GC Solr can't let ZooKeeper know it's
> still alive and so gets marked as down. I've just taken a look at the GC
> logs and can see a couple of full collections which took longer than my ZK
> timeout of 15s). I'm still in the process of tuning the cache sizes and
> have probably got it wrong (I'm coming from a Solr instance which runs on a
> 48G heap with ~40m documents and bringing it into five shards with 8G
> heap). I thought I was being conservative with the cache sizes but I should
> probably drop them right down and start again. The entire index is cached
> by Linux so I should just need caches to help with things which eat CPU at
> request time.
>
> The indexing level is unusual because normally we wouldn't be indexing
> everything sequentially, just making delta updates to the index as things
> are changed in our MoR. However, it's handy to know how it reacts under the
> most extreme load we could give it.
>
> In the case that I set my hard commit time to 15-30 seconds with
> openSearcher set to false, how do I control when I actually do invalidate
> the caches and open a new searcher? Is this something that Solr can do
> automatically, or will I need some sort of coordinator process to perform a
> 'proper' commit from outside Solr?
>
> In our case the process of opening a new searcher is definitely a hefty
> operation. We have a large number of boosts and filters which are used for
> just about every query that is made against the index so we currently have
> them warmed which can take upwards of a minute on our giant core.
>
> Thanks for your help.
>
>
> On 22 July 2013 13:00, Erick Erickson <erickerick...@gmail.com> wrote:
>
>> Wow, you really shouldn't be having nodes go up and down so
>> frequently, that's a big red flag. That said, SolrCloud should be
>> pretty robust so this is something to pursue...
>>
>> But even a 5 minute hard commit can lead to a hefty transaction
>> log under load, you may want to reduce it substantially depending
>> on how fast you are sending docs to the index. I'm talking
>> 15-30 seconds here. It's critical that openSearcher be set to false
>> or you'll invalidate your caches that often. All a hard commit
>> with openSearcher set to false does is close off the current segment
>> and open a new one. It does NOT open/warm new searchers etc.
>>
>> The soft commits control visibility, so that's how you control
>> whether you can search the docs or not. Pardon me if I'm
>> repeating stuff you already know!
>>
>> As far as your nodes coming and going, I've seen some people have
>> good results by upping the ZooKeeper timeout limit. So I guess
>> my first question is whether the nodes are actually going out of service
>> or whether it's just a timeout issue....
>>
>> Good luck!
>> Erick
>>
>> On Mon, Jul 22, 2013 at 3:29 AM, Neil Prosser <neil.pros...@gmail.com>
>> wrote:
>> > Very true. I was impatient (I think less than three minutes impatient so
>> > hopefully 4.4 will save me from myself) but I didn't realise it was
>> doing
>> > something rather than just hanging. Next time I have to restart a node
>> I'll
>> > just leave and go get a cup of coffee or something.
>> >
>> > My configuration is set to auto hard-commit every 5 minutes. No auto
>> > soft-commit time is set.
>> >
>> > Over the course of the weekend, while left unattended the nodes have
>> been
>> > going up and down (I've got to solve the issue that is causing them to
>> come
>> > and go, but any suggestions on what is likely to be causing something
>> like
>> > that are welcome), at one point one of the nodes stopped taking updates.
>> > After indexing properly for a few hours with that one shard not
>> accepting
>> > updates, the replica of that shard which contains all the correct
>> documents
>> > must have replicated from the broken node and dropped documents. Is
>> there
>> > any protection against this in Solr or should I be focusing on getting
>> my
>> > nodes to be more reliable? I've now got a situation where four of my
>> five
>> > shards have leaders who are marked as down and followers who are up.
>> >
>> > I'm going to start grabbing information about the cluster state so I can
>> > track which changes are happening and in what order. I can get hold of
>> Solr
>> > logs and garbage collection logs while these things are happening.
>> >
>> > Is this all just down to my nodes being unreliable?
>> >
>> >
>> > On 21 July 2013 13:52, Erick Erickson <erickerick...@gmail.com> wrote:
>> >
>> >> Well, if I'm reading this right you had a node go out of circulation
>> >> and then bounced nodes until that node became the leader. So of course
>> >> it wouldn't have the documents (how could it?). Basically you shot
>> >> yourself in the foot.
>> >>
>> >> Underlying here is why it took the machine you were re-starting so
>> >> long to come up that you got impatient and started killing nodes.
>> >> There has been quite a bit done to make that process better, so what
>> >> version of Solr are you using? 4.4 is being voted on right now, so if
>> >> you might want to consider upgrading.
>> >>
>> >> There was, for instance, a situation where it would take 3 minutes for
>> >> machines to start up. How impatient were you?
>> >>
>> >> Also, what are your hard commit parameters? All of the documents
>> >> you're indexing will be in the transaction log between hard commits,
>> >> and when a node comes up the leader will replay everything in the tlog
>> >> to the new node, which might be a source of why it took so long for
>> >> the new node to come back up. At the very least the new node you were
>> >> bringing back online will need to do a full index replication (old
>> >> style) to get caught up.
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Fri, Jul 19, 2013 at 4:02 AM, Neil Prosser <neil.pros...@gmail.com>
>> >> wrote:
>> >> > While indexing some documents to a SolrCloud cluster (10 machines, 5
>> >> shards
>> >> > and 2 replicas, so one replica on each machine) one of the replicas
>> >> stopped
>> >> > receiving documents, while the other replica of the shard continued
>> to
>> >> grow.
>> >> >
>> >> > That was overnight so I was unable to track exactly what happened
>> (I'm
>> >> > going off our Graphite graphs here). This morning when I was able to
>> look
>> >> > at the cluster both replicas of that shard were marked as down (with
>> one
>> >> > marked as leader). I attempted to restart the non-leader node but it
>> >> took a
>> >> > long time to restart so I killed it and restarted the old leader,
>> which
>> >> > also took a long time. I killed that one (I'm impatient) and left the
>> >> > non-leader node to restart, not realising it was missing
>> approximately
>> >> 700k
>> >> > documents that the old leader had. Eventually it restarted and became
>> >> > leader. I restarted the old leader and it dropped the number of
>> documents
>> >> > it had to match the previous non-leader.
>> >> >
>> >> > Is this expected behaviour when a replica with fewer documents is
>> started
>> >> > before the other and elected leader? Should I have been paying more
>> >> > attention to the number of documents on the server before restarting
>> >> nodes?
>> >> >
>> >> > I am still in the process of tuning the caches and warming for these
>> >> > servers but we are putting some load through the cluster so it is
>> >> possible
>> >> > that the nodes are having to work quite hard when a new version of
>> the
>> >> core
>> >> > comes is made available. Is this likely to explain why I
>> occasionally see
>> >> > nodes dropping out? Unfortunately in restarting the nodes I lost the
>> GC
>> >> > logs to see whether that was likely to be the culprit. Is this the
>> sort
>> >> of
>> >> > situation where you raise the ZooKeeper timeout a bit? Currently the
>> >> > timeout for all nodes is 15 seconds.
>> >> >
>> >> > Are there any known issues which might explain what's happening? I'm
>> just
>> >> > getting started with SolrCloud after using standard master/slave
>> >> > replication for an index which has got too big for one machine over
>> the
>> >> > last few months.
>> >> >
>> >> > Also, is there any particular information that would be helpful to
>> help
>> >> > with these issues if it should happen again?
>> >>
>>
>
>

Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

Reply via email to