Sorry, I should also mention that these leader nodes which are marked as down can actually still be queried locally with distrib=false with no problems. Is it possible that they've somehow got themselves out-of-sync?
On 22 July 2013 13:37, Neil Prosser <neil.pros...@gmail.com> wrote: > No need to apologise. It's always good to have things like that reiterated > in case I've misunderstood along the way. > > I have a feeling that it's related to garbage collection. I assume that if > the JVM heads into a stop-the-world GC Solr can't let ZooKeeper know it's > still alive and so gets marked as down. I've just taken a look at the GC > logs and can see a couple of full collections which took longer than my ZK > timeout of 15s). I'm still in the process of tuning the cache sizes and > have probably got it wrong (I'm coming from a Solr instance which runs on a > 48G heap with ~40m documents and bringing it into five shards with 8G > heap). I thought I was being conservative with the cache sizes but I should > probably drop them right down and start again. The entire index is cached > by Linux so I should just need caches to help with things which eat CPU at > request time. > > The indexing level is unusual because normally we wouldn't be indexing > everything sequentially, just making delta updates to the index as things > are changed in our MoR. However, it's handy to know how it reacts under the > most extreme load we could give it. > > In the case that I set my hard commit time to 15-30 seconds with > openSearcher set to false, how do I control when I actually do invalidate > the caches and open a new searcher? Is this something that Solr can do > automatically, or will I need some sort of coordinator process to perform a > 'proper' commit from outside Solr? > > In our case the process of opening a new searcher is definitely a hefty > operation. We have a large number of boosts and filters which are used for > just about every query that is made against the index so we currently have > them warmed which can take upwards of a minute on our giant core. > > Thanks for your help. > > > On 22 July 2013 13:00, Erick Erickson <erickerick...@gmail.com> wrote: > >> Wow, you really shouldn't be having nodes go up and down so >> frequently, that's a big red flag. That said, SolrCloud should be >> pretty robust so this is something to pursue... >> >> But even a 5 minute hard commit can lead to a hefty transaction >> log under load, you may want to reduce it substantially depending >> on how fast you are sending docs to the index. I'm talking >> 15-30 seconds here. It's critical that openSearcher be set to false >> or you'll invalidate your caches that often. All a hard commit >> with openSearcher set to false does is close off the current segment >> and open a new one. It does NOT open/warm new searchers etc. >> >> The soft commits control visibility, so that's how you control >> whether you can search the docs or not. Pardon me if I'm >> repeating stuff you already know! >> >> As far as your nodes coming and going, I've seen some people have >> good results by upping the ZooKeeper timeout limit. So I guess >> my first question is whether the nodes are actually going out of service >> or whether it's just a timeout issue.... >> >> Good luck! >> Erick >> >> On Mon, Jul 22, 2013 at 3:29 AM, Neil Prosser <neil.pros...@gmail.com> >> wrote: >> > Very true. I was impatient (I think less than three minutes impatient so >> > hopefully 4.4 will save me from myself) but I didn't realise it was >> doing >> > something rather than just hanging. Next time I have to restart a node >> I'll >> > just leave and go get a cup of coffee or something. >> > >> > My configuration is set to auto hard-commit every 5 minutes. No auto >> > soft-commit time is set. >> > >> > Over the course of the weekend, while left unattended the nodes have >> been >> > going up and down (I've got to solve the issue that is causing them to >> come >> > and go, but any suggestions on what is likely to be causing something >> like >> > that are welcome), at one point one of the nodes stopped taking updates. >> > After indexing properly for a few hours with that one shard not >> accepting >> > updates, the replica of that shard which contains all the correct >> documents >> > must have replicated from the broken node and dropped documents. Is >> there >> > any protection against this in Solr or should I be focusing on getting >> my >> > nodes to be more reliable? I've now got a situation where four of my >> five >> > shards have leaders who are marked as down and followers who are up. >> > >> > I'm going to start grabbing information about the cluster state so I can >> > track which changes are happening and in what order. I can get hold of >> Solr >> > logs and garbage collection logs while these things are happening. >> > >> > Is this all just down to my nodes being unreliable? >> > >> > >> > On 21 July 2013 13:52, Erick Erickson <erickerick...@gmail.com> wrote: >> > >> >> Well, if I'm reading this right you had a node go out of circulation >> >> and then bounced nodes until that node became the leader. So of course >> >> it wouldn't have the documents (how could it?). Basically you shot >> >> yourself in the foot. >> >> >> >> Underlying here is why it took the machine you were re-starting so >> >> long to come up that you got impatient and started killing nodes. >> >> There has been quite a bit done to make that process better, so what >> >> version of Solr are you using? 4.4 is being voted on right now, so if >> >> you might want to consider upgrading. >> >> >> >> There was, for instance, a situation where it would take 3 minutes for >> >> machines to start up. How impatient were you? >> >> >> >> Also, what are your hard commit parameters? All of the documents >> >> you're indexing will be in the transaction log between hard commits, >> >> and when a node comes up the leader will replay everything in the tlog >> >> to the new node, which might be a source of why it took so long for >> >> the new node to come back up. At the very least the new node you were >> >> bringing back online will need to do a full index replication (old >> >> style) to get caught up. >> >> >> >> Best >> >> Erick >> >> >> >> On Fri, Jul 19, 2013 at 4:02 AM, Neil Prosser <neil.pros...@gmail.com> >> >> wrote: >> >> > While indexing some documents to a SolrCloud cluster (10 machines, 5 >> >> shards >> >> > and 2 replicas, so one replica on each machine) one of the replicas >> >> stopped >> >> > receiving documents, while the other replica of the shard continued >> to >> >> grow. >> >> > >> >> > That was overnight so I was unable to track exactly what happened >> (I'm >> >> > going off our Graphite graphs here). This morning when I was able to >> look >> >> > at the cluster both replicas of that shard were marked as down (with >> one >> >> > marked as leader). I attempted to restart the non-leader node but it >> >> took a >> >> > long time to restart so I killed it and restarted the old leader, >> which >> >> > also took a long time. I killed that one (I'm impatient) and left the >> >> > non-leader node to restart, not realising it was missing >> approximately >> >> 700k >> >> > documents that the old leader had. Eventually it restarted and became >> >> > leader. I restarted the old leader and it dropped the number of >> documents >> >> > it had to match the previous non-leader. >> >> > >> >> > Is this expected behaviour when a replica with fewer documents is >> started >> >> > before the other and elected leader? Should I have been paying more >> >> > attention to the number of documents on the server before restarting >> >> nodes? >> >> > >> >> > I am still in the process of tuning the caches and warming for these >> >> > servers but we are putting some load through the cluster so it is >> >> possible >> >> > that the nodes are having to work quite hard when a new version of >> the >> >> core >> >> > comes is made available. Is this likely to explain why I >> occasionally see >> >> > nodes dropping out? Unfortunately in restarting the nodes I lost the >> GC >> >> > logs to see whether that was likely to be the culprit. Is this the >> sort >> >> of >> >> > situation where you raise the ZooKeeper timeout a bit? Currently the >> >> > timeout for all nodes is 15 seconds. >> >> > >> >> > Are there any known issues which might explain what's happening? I'm >> just >> >> > getting started with SolrCloud after using standard master/slave >> >> > replication for an index which has got too big for one machine over >> the >> >> > last few months. >> >> > >> >> > Also, is there any particular information that would be helpful to >> help >> >> > with these issues if it should happen again? >> >> >> > >