It is possible: https://issues.apache.org/jira/browse/SOLR-4260 I rarely see it and i cannot reliably reproduce it but it just sometimes happens. Nodes will not bring each other back in sync.
-----Original message----- > From:Neil Prosser <neil.pros...@gmail.com> > Sent: Monday 22nd July 2013 14:41 > To: solr-user@lucene.apache.org > Subject: Re: Solr 4.3.1 - SolrCloud nodes down and lost documents > > Sorry, I should also mention that these leader nodes which are marked as > down can actually still be queried locally with distrib=false with no > problems. Is it possible that they've somehow got themselves out-of-sync? > > > On 22 July 2013 13:37, Neil Prosser <neil.pros...@gmail.com> wrote: > > > No need to apologise. It's always good to have things like that reiterated > > in case I've misunderstood along the way. > > > > I have a feeling that it's related to garbage collection. I assume that if > > the JVM heads into a stop-the-world GC Solr can't let ZooKeeper know it's > > still alive and so gets marked as down. I've just taken a look at the GC > > logs and can see a couple of full collections which took longer than my ZK > > timeout of 15s). I'm still in the process of tuning the cache sizes and > > have probably got it wrong (I'm coming from a Solr instance which runs on a > > 48G heap with ~40m documents and bringing it into five shards with 8G > > heap). I thought I was being conservative with the cache sizes but I should > > probably drop them right down and start again. The entire index is cached > > by Linux so I should just need caches to help with things which eat CPU at > > request time. > > > > The indexing level is unusual because normally we wouldn't be indexing > > everything sequentially, just making delta updates to the index as things > > are changed in our MoR. However, it's handy to know how it reacts under the > > most extreme load we could give it. > > > > In the case that I set my hard commit time to 15-30 seconds with > > openSearcher set to false, how do I control when I actually do invalidate > > the caches and open a new searcher? Is this something that Solr can do > > automatically, or will I need some sort of coordinator process to perform a > > 'proper' commit from outside Solr? > > > > In our case the process of opening a new searcher is definitely a hefty > > operation. We have a large number of boosts and filters which are used for > > just about every query that is made against the index so we currently have > > them warmed which can take upwards of a minute on our giant core. > > > > Thanks for your help. > > > > > > On 22 July 2013 13:00, Erick Erickson <erickerick...@gmail.com> wrote: > > > >> Wow, you really shouldn't be having nodes go up and down so > >> frequently, that's a big red flag. That said, SolrCloud should be > >> pretty robust so this is something to pursue... > >> > >> But even a 5 minute hard commit can lead to a hefty transaction > >> log under load, you may want to reduce it substantially depending > >> on how fast you are sending docs to the index. I'm talking > >> 15-30 seconds here. It's critical that openSearcher be set to false > >> or you'll invalidate your caches that often. All a hard commit > >> with openSearcher set to false does is close off the current segment > >> and open a new one. It does NOT open/warm new searchers etc. > >> > >> The soft commits control visibility, so that's how you control > >> whether you can search the docs or not. Pardon me if I'm > >> repeating stuff you already know! > >> > >> As far as your nodes coming and going, I've seen some people have > >> good results by upping the ZooKeeper timeout limit. So I guess > >> my first question is whether the nodes are actually going out of service > >> or whether it's just a timeout issue.... > >> > >> Good luck! > >> Erick > >> > >> On Mon, Jul 22, 2013 at 3:29 AM, Neil Prosser <neil.pros...@gmail.com> > >> wrote: > >> > Very true. I was impatient (I think less than three minutes impatient so > >> > hopefully 4.4 will save me from myself) but I didn't realise it was > >> doing > >> > something rather than just hanging. Next time I have to restart a node > >> I'll > >> > just leave and go get a cup of coffee or something. > >> > > >> > My configuration is set to auto hard-commit every 5 minutes. No auto > >> > soft-commit time is set. > >> > > >> > Over the course of the weekend, while left unattended the nodes have > >> been > >> > going up and down (I've got to solve the issue that is causing them to > >> come > >> > and go, but any suggestions on what is likely to be causing something > >> like > >> > that are welcome), at one point one of the nodes stopped taking updates. > >> > After indexing properly for a few hours with that one shard not > >> accepting > >> > updates, the replica of that shard which contains all the correct > >> documents > >> > must have replicated from the broken node and dropped documents. Is > >> there > >> > any protection against this in Solr or should I be focusing on getting > >> my > >> > nodes to be more reliable? I've now got a situation where four of my > >> five > >> > shards have leaders who are marked as down and followers who are up. > >> > > >> > I'm going to start grabbing information about the cluster state so I can > >> > track which changes are happening and in what order. I can get hold of > >> Solr > >> > logs and garbage collection logs while these things are happening. > >> > > >> > Is this all just down to my nodes being unreliable? > >> > > >> > > >> > On 21 July 2013 13:52, Erick Erickson <erickerick...@gmail.com> wrote: > >> > > >> >> Well, if I'm reading this right you had a node go out of circulation > >> >> and then bounced nodes until that node became the leader. So of course > >> >> it wouldn't have the documents (how could it?). Basically you shot > >> >> yourself in the foot. > >> >> > >> >> Underlying here is why it took the machine you were re-starting so > >> >> long to come up that you got impatient and started killing nodes. > >> >> There has been quite a bit done to make that process better, so what > >> >> version of Solr are you using? 4.4 is being voted on right now, so if > >> >> you might want to consider upgrading. > >> >> > >> >> There was, for instance, a situation where it would take 3 minutes for > >> >> machines to start up. How impatient were you? > >> >> > >> >> Also, what are your hard commit parameters? All of the documents > >> >> you're indexing will be in the transaction log between hard commits, > >> >> and when a node comes up the leader will replay everything in the tlog > >> >> to the new node, which might be a source of why it took so long for > >> >> the new node to come back up. At the very least the new node you were > >> >> bringing back online will need to do a full index replication (old > >> >> style) to get caught up. > >> >> > >> >> Best > >> >> Erick > >> >> > >> >> On Fri, Jul 19, 2013 at 4:02 AM, Neil Prosser <neil.pros...@gmail.com> > >> >> wrote: > >> >> > While indexing some documents to a SolrCloud cluster (10 machines, 5 > >> >> shards > >> >> > and 2 replicas, so one replica on each machine) one of the replicas > >> >> stopped > >> >> > receiving documents, while the other replica of the shard continued > >> to > >> >> grow. > >> >> > > >> >> > That was overnight so I was unable to track exactly what happened > >> (I'm > >> >> > going off our Graphite graphs here). This morning when I was able to > >> look > >> >> > at the cluster both replicas of that shard were marked as down (with > >> one > >> >> > marked as leader). I attempted to restart the non-leader node but it > >> >> took a > >> >> > long time to restart so I killed it and restarted the old leader, > >> which > >> >> > also took a long time. I killed that one (I'm impatient) and left the > >> >> > non-leader node to restart, not realising it was missing > >> approximately > >> >> 700k > >> >> > documents that the old leader had. Eventually it restarted and became > >> >> > leader. I restarted the old leader and it dropped the number of > >> documents > >> >> > it had to match the previous non-leader. > >> >> > > >> >> > Is this expected behaviour when a replica with fewer documents is > >> started > >> >> > before the other and elected leader? Should I have been paying more > >> >> > attention to the number of documents on the server before restarting > >> >> nodes? > >> >> > > >> >> > I am still in the process of tuning the caches and warming for these > >> >> > servers but we are putting some load through the cluster so it is > >> >> possible > >> >> > that the nodes are having to work quite hard when a new version of > >> the > >> >> core > >> >> > comes is made available. Is this likely to explain why I > >> occasionally see > >> >> > nodes dropping out? Unfortunately in restarting the nodes I lost the > >> GC > >> >> > logs to see whether that was likely to be the culprit. Is this the > >> sort > >> >> of > >> >> > situation where you raise the ZooKeeper timeout a bit? Currently the > >> >> > timeout for all nodes is 15 seconds. > >> >> > > >> >> > Are there any known issues which might explain what's happening? I'm > >> just > >> >> > getting started with SolrCloud after using standard master/slave > >> >> > replication for an index which has got too big for one machine over > >> the > >> >> > last few months. > >> >> > > >> >> > Also, is there any particular information that would be helpful to > >> help > >> >> > with these issues if it should happen again? > >> >> > >> > > > > >