RE: Solr 4.3.1 - SolrCloud nodes down and lost documents

Markus Jelsma Mon, 22 Jul 2013 05:45:30 -0700

It is possible: https://issues.apache.org/jira/browse/SOLR-4260
I rarely see it and i cannot reliably reproduce it but it just sometimes 
happens. Nodes will not bring each other back in sync.


 
 
-----Original message-----
> From:Neil Prosser <neil.pros...@gmail.com>
> Sent: Monday 22nd July 2013 14:41
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 4.3.1 - SolrCloud nodes down and lost documents
> 
> Sorry, I should also mention that these leader nodes which are marked as
> down can actually still be queried locally with distrib=false with no
> problems. Is it possible that they've somehow got themselves out-of-sync?
> 
> 
> On 22 July 2013 13:37, Neil Prosser <neil.pros...@gmail.com> wrote:
> 
> > No need to apologise. It's always good to have things like that reiterated
> > in case I've misunderstood along the way.
> >
> > I have a feeling that it's related to garbage collection. I assume that if
> > the JVM heads into a stop-the-world GC Solr can't let ZooKeeper know it's
> > still alive and so gets marked as down. I've just taken a look at the GC
> > logs and can see a couple of full collections which took longer than my ZK
> > timeout of 15s). I'm still in the process of tuning the cache sizes and
> > have probably got it wrong (I'm coming from a Solr instance which runs on a
> > 48G heap with ~40m documents and bringing it into five shards with 8G
> > heap). I thought I was being conservative with the cache sizes but I should
> > probably drop them right down and start again. The entire index is cached
> > by Linux so I should just need caches to help with things which eat CPU at
> > request time.
> >
> > The indexing level is unusual because normally we wouldn't be indexing
> > everything sequentially, just making delta updates to the index as things
> > are changed in our MoR. However, it's handy to know how it reacts under the
> > most extreme load we could give it.
> >
> > In the case that I set my hard commit time to 15-30 seconds with
> > openSearcher set to false, how do I control when I actually do invalidate
> > the caches and open a new searcher? Is this something that Solr can do
> > automatically, or will I need some sort of coordinator process to perform a
> > 'proper' commit from outside Solr?
> >
> > In our case the process of opening a new searcher is definitely a hefty
> > operation. We have a large number of boosts and filters which are used for
> > just about every query that is made against the index so we currently have
> > them warmed which can take upwards of a minute on our giant core.
> >
> > Thanks for your help.
> >
> >
> > On 22 July 2013 13:00, Erick Erickson <erickerick...@gmail.com> wrote:
> >
> >> Wow, you really shouldn't be having nodes go up and down so
> >> frequently, that's a big red flag. That said, SolrCloud should be
> >> pretty robust so this is something to pursue...
> >>
> >> But even a 5 minute hard commit can lead to a hefty transaction
> >> log under load, you may want to reduce it substantially depending
> >> on how fast you are sending docs to the index. I'm talking
> >> 15-30 seconds here. It's critical that openSearcher be set to false
> >> or you'll invalidate your caches that often. All a hard commit
> >> with openSearcher set to false does is close off the current segment
> >> and open a new one. It does NOT open/warm new searchers etc.
> >>
> >> The soft commits control visibility, so that's how you control
> >> whether you can search the docs or not. Pardon me if I'm
> >> repeating stuff you already know!
> >>
> >> As far as your nodes coming and going, I've seen some people have
> >> good results by upping the ZooKeeper timeout limit. So I guess
> >> my first question is whether the nodes are actually going out of service
> >> or whether it's just a timeout issue....
> >>
> >> Good luck!
> >> Erick
> >>
> >> On Mon, Jul 22, 2013 at 3:29 AM, Neil Prosser <neil.pros...@gmail.com>
> >> wrote:
> >> > Very true. I was impatient (I think less than three minutes impatient so
> >> > hopefully 4.4 will save me from myself) but I didn't realise it was
> >> doing
> >> > something rather than just hanging. Next time I have to restart a node
> >> I'll
> >> > just leave and go get a cup of coffee or something.
> >> >
> >> > My configuration is set to auto hard-commit every 5 minutes. No auto
> >> > soft-commit time is set.
> >> >
> >> > Over the course of the weekend, while left unattended the nodes have
> >> been
> >> > going up and down (I've got to solve the issue that is causing them to
> >> come
> >> > and go, but any suggestions on what is likely to be causing something
> >> like
> >> > that are welcome), at one point one of the nodes stopped taking updates.
> >> > After indexing properly for a few hours with that one shard not
> >> accepting
> >> > updates, the replica of that shard which contains all the correct
> >> documents
> >> > must have replicated from the broken node and dropped documents. Is
> >> there
> >> > any protection against this in Solr or should I be focusing on getting
> >> my
> >> > nodes to be more reliable? I've now got a situation where four of my
> >> five
> >> > shards have leaders who are marked as down and followers who are up.
> >> >
> >> > I'm going to start grabbing information about the cluster state so I can
> >> > track which changes are happening and in what order. I can get hold of
> >> Solr
> >> > logs and garbage collection logs while these things are happening.
> >> >
> >> > Is this all just down to my nodes being unreliable?
> >> >
> >> >
> >> > On 21 July 2013 13:52, Erick Erickson <erickerick...@gmail.com> wrote:
> >> >
> >> >> Well, if I'm reading this right you had a node go out of circulation
> >> >> and then bounced nodes until that node became the leader. So of course
> >> >> it wouldn't have the documents (how could it?). Basically you shot
> >> >> yourself in the foot.
> >> >>
> >> >> Underlying here is why it took the machine you were re-starting so
> >> >> long to come up that you got impatient and started killing nodes.
> >> >> There has been quite a bit done to make that process better, so what
> >> >> version of Solr are you using? 4.4 is being voted on right now, so if
> >> >> you might want to consider upgrading.
> >> >>
> >> >> There was, for instance, a situation where it would take 3 minutes for
> >> >> machines to start up. How impatient were you?
> >> >>
> >> >> Also, what are your hard commit parameters? All of the documents
> >> >> you're indexing will be in the transaction log between hard commits,
> >> >> and when a node comes up the leader will replay everything in the tlog
> >> >> to the new node, which might be a source of why it took so long for
> >> >> the new node to come back up. At the very least the new node you were
> >> >> bringing back online will need to do a full index replication (old
> >> >> style) to get caught up.
> >> >>
> >> >> Best
> >> >> Erick
> >> >>
> >> >> On Fri, Jul 19, 2013 at 4:02 AM, Neil Prosser <neil.pros...@gmail.com>
> >> >> wrote:
> >> >> > While indexing some documents to a SolrCloud cluster (10 machines, 5
> >> >> shards
> >> >> > and 2 replicas, so one replica on each machine) one of the replicas
> >> >> stopped
> >> >> > receiving documents, while the other replica of the shard continued
> >> to
> >> >> grow.
> >> >> >
> >> >> > That was overnight so I was unable to track exactly what happened
> >> (I'm
> >> >> > going off our Graphite graphs here). This morning when I was able to
> >> look
> >> >> > at the cluster both replicas of that shard were marked as down (with
> >> one
> >> >> > marked as leader). I attempted to restart the non-leader node but it
> >> >> took a
> >> >> > long time to restart so I killed it and restarted the old leader,
> >> which
> >> >> > also took a long time. I killed that one (I'm impatient) and left the
> >> >> > non-leader node to restart, not realising it was missing
> >> approximately
> >> >> 700k
> >> >> > documents that the old leader had. Eventually it restarted and became
> >> >> > leader. I restarted the old leader and it dropped the number of
> >> documents
> >> >> > it had to match the previous non-leader.
> >> >> >
> >> >> > Is this expected behaviour when a replica with fewer documents is
> >> started
> >> >> > before the other and elected leader? Should I have been paying more
> >> >> > attention to the number of documents on the server before restarting
> >> >> nodes?
> >> >> >
> >> >> > I am still in the process of tuning the caches and warming for these
> >> >> > servers but we are putting some load through the cluster so it is
> >> >> possible
> >> >> > that the nodes are having to work quite hard when a new version of
> >> the
> >> >> core
> >> >> > comes is made available. Is this likely to explain why I
> >> occasionally see
> >> >> > nodes dropping out? Unfortunately in restarting the nodes I lost the
> >> GC
> >> >> > logs to see whether that was likely to be the culprit. Is this the
> >> sort
> >> >> of
> >> >> > situation where you raise the ZooKeeper timeout a bit? Currently the
> >> >> > timeout for all nodes is 15 seconds.
> >> >> >
> >> >> > Are there any known issues which might explain what's happening? I'm
> >> just
> >> >> > getting started with SolrCloud after using standard master/slave
> >> >> > replication for an index which has got too big for one machine over
> >> the
> >> >> > last few months.
> >> >> >
> >> >> > Also, is there any particular information that would be helpful to
> >> help
> >> >> > with these issues if it should happen again?
> >> >>
> >>
> >
> >
>

RE: Solr 4.3.1 - SolrCloud nodes down and lost documents

Reply via email to