Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

Erick Erickson Mon, 22 Jul 2013 05:01:40 -0700

Wow, you really shouldn't be having nodes go up and down so
frequently, that's a big red flag. That said, SolrCloud should be
pretty robust so this is something to pursue...


But even a 5 minute hard commit can lead to a hefty transaction
log under load, you may want to reduce it substantially depending
on how fast you are sending docs to the index. I'm talking
15-30 seconds here. It's critical that openSearcher be set to false
or you'll invalidate your caches that often. All a hard commit
with openSearcher set to false does is close off the current segment
and open a new one. It does NOT open/warm new searchers etc.

The soft commits control visibility, so that's how you control
whether you can search the docs or not. Pardon me if I'm
repeating stuff you already know!

As far as your nodes coming and going, I've seen some people have
good results by upping the ZooKeeper timeout limit. So I guess
my first question is whether the nodes are actually going out of service
or whether it's just a timeout issue....

Good luck!
Erick

On Mon, Jul 22, 2013 at 3:29 AM, Neil Prosser <neil.pros...@gmail.com> wrote:
> Very true. I was impatient (I think less than three minutes impatient so
> hopefully 4.4 will save me from myself) but I didn't realise it was doing
> something rather than just hanging. Next time I have to restart a node I'll
> just leave and go get a cup of coffee or something.
>
> My configuration is set to auto hard-commit every 5 minutes. No auto
> soft-commit time is set.
>
> Over the course of the weekend, while left unattended the nodes have been
> going up and down (I've got to solve the issue that is causing them to come
> and go, but any suggestions on what is likely to be causing something like
> that are welcome), at one point one of the nodes stopped taking updates.
> After indexing properly for a few hours with that one shard not accepting
> updates, the replica of that shard which contains all the correct documents
> must have replicated from the broken node and dropped documents. Is there
> any protection against this in Solr or should I be focusing on getting my
> nodes to be more reliable? I've now got a situation where four of my five
> shards have leaders who are marked as down and followers who are up.
>
> I'm going to start grabbing information about the cluster state so I can
> track which changes are happening and in what order. I can get hold of Solr
> logs and garbage collection logs while these things are happening.
>
> Is this all just down to my nodes being unreliable?
>
>
> On 21 July 2013 13:52, Erick Erickson <erickerick...@gmail.com> wrote:
>
>> Well, if I'm reading this right you had a node go out of circulation
>> and then bounced nodes until that node became the leader. So of course
>> it wouldn't have the documents (how could it?). Basically you shot
>> yourself in the foot.
>>
>> Underlying here is why it took the machine you were re-starting so
>> long to come up that you got impatient and started killing nodes.
>> There has been quite a bit done to make that process better, so what
>> version of Solr are you using? 4.4 is being voted on right now, so if
>> you might want to consider upgrading.
>>
>> There was, for instance, a situation where it would take 3 minutes for
>> machines to start up. How impatient were you?
>>
>> Also, what are your hard commit parameters? All of the documents
>> you're indexing will be in the transaction log between hard commits,
>> and when a node comes up the leader will replay everything in the tlog
>> to the new node, which might be a source of why it took so long for
>> the new node to come back up. At the very least the new node you were
>> bringing back online will need to do a full index replication (old
>> style) to get caught up.
>>
>> Best
>> Erick
>>
>> On Fri, Jul 19, 2013 at 4:02 AM, Neil Prosser <neil.pros...@gmail.com>
>> wrote:
>> > While indexing some documents to a SolrCloud cluster (10 machines, 5
>> shards
>> > and 2 replicas, so one replica on each machine) one of the replicas
>> stopped
>> > receiving documents, while the other replica of the shard continued to
>> grow.
>> >
>> > That was overnight so I was unable to track exactly what happened (I'm
>> > going off our Graphite graphs here). This morning when I was able to look
>> > at the cluster both replicas of that shard were marked as down (with one
>> > marked as leader). I attempted to restart the non-leader node but it
>> took a
>> > long time to restart so I killed it and restarted the old leader, which
>> > also took a long time. I killed that one (I'm impatient) and left the
>> > non-leader node to restart, not realising it was missing approximately
>> 700k
>> > documents that the old leader had. Eventually it restarted and became
>> > leader. I restarted the old leader and it dropped the number of documents
>> > it had to match the previous non-leader.
>> >
>> > Is this expected behaviour when a replica with fewer documents is started
>> > before the other and elected leader? Should I have been paying more
>> > attention to the number of documents on the server before restarting
>> nodes?
>> >
>> > I am still in the process of tuning the caches and warming for these
>> > servers but we are putting some load through the cluster so it is
>> possible
>> > that the nodes are having to work quite hard when a new version of the
>> core
>> > comes is made available. Is this likely to explain why I occasionally see
>> > nodes dropping out? Unfortunately in restarting the nodes I lost the GC
>> > logs to see whether that was likely to be the culprit. Is this the sort
>> of
>> > situation where you raise the ZooKeeper timeout a bit? Currently the
>> > timeout for all nodes is 15 seconds.
>> >
>> > Are there any known issues which might explain what's happening? I'm just
>> > getting started with SolrCloud after using standard master/slave
>> > replication for an index which has got too big for one machine over the
>> > last few months.
>> >
>> > Also, is there any particular information that would be helpful to help
>> > with these issues if it should happen again?
>>

Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

Reply via email to