You should increase your ZK time out, this may be the issue in your case. You
may also want to try the G1GC collector to keep STW under ZK time out.
-----Original message-----
> From:Neil Prosser <neil.pros...@gmail.com>
> Sent: Monday 22nd July 2013 14:38
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 4.3.1 - SolrCloud nodes down and lost documents
>
> No need to apologise. It's always good to have things like that reiterated
> in case I've misunderstood along the way.
>
> I have a feeling that it's related to garbage collection. I assume that if
> the JVM heads into a stop-the-world GC Solr can't let ZooKeeper know it's
> still alive and so gets marked as down. I've just taken a look at the GC
> logs and can see a couple of full collections which took longer than my ZK
> timeout of 15s). I'm still in the process of tuning the cache sizes and
> have probably got it wrong (I'm coming from a Solr instance which runs on a
> 48G heap with ~40m documents and bringing it into five shards with 8G
> heap). I thought I was being conservative with the cache sizes but I should
> probably drop them right down and start again. The entire index is cached
> by Linux so I should just need caches to help with things which eat CPU at
> request time.
>
> The indexing level is unusual because normally we wouldn't be indexing
> everything sequentially, just making delta updates to the index as things
> are changed in our MoR. However, it's handy to know how it reacts under the
> most extreme load we could give it.
>
> In the case that I set my hard commit time to 15-30 seconds with
> openSearcher set to false, how do I control when I actually do invalidate
> the caches and open a new searcher? Is this something that Solr can do
> automatically, or will I need some sort of coordinator process to perform a
> 'proper' commit from outside Solr?
>
> In our case the process of opening a new searcher is definitely a hefty
> operation. We have a large number of boosts and filters which are used for
> just about every query that is made against the index so we currently have
> them warmed which can take upwards of a minute on our giant core.
>
> Thanks for your help.
>
>
> On 22 July 2013 13:00, Erick Erickson <erickerick...@gmail.com> wrote:
>
> > Wow, you really shouldn't be having nodes go up and down so
> > frequently, that's a big red flag. That said, SolrCloud should be
> > pretty robust so this is something to pursue...
> >
> > But even a 5 minute hard commit can lead to a hefty transaction
> > log under load, you may want to reduce it substantially depending
> > on how fast you are sending docs to the index. I'm talking
> > 15-30 seconds here. It's critical that openSearcher be set to false
> > or you'll invalidate your caches that often. All a hard commit
> > with openSearcher set to false does is close off the current segment
> > and open a new one. It does NOT open/warm new searchers etc.
> >
> > The soft commits control visibility, so that's how you control
> > whether you can search the docs or not. Pardon me if I'm
> > repeating stuff you already know!
> >
> > As far as your nodes coming and going, I've seen some people have
> > good results by upping the ZooKeeper timeout limit. So I guess
> > my first question is whether the nodes are actually going out of service
> > or whether it's just a timeout issue....
> >
> > Good luck!
> > Erick
> >
> > On Mon, Jul 22, 2013 at 3:29 AM, Neil Prosser <neil.pros...@gmail.com>
> > wrote:
> > > Very true. I was impatient (I think less than three minutes impatient so
> > > hopefully 4.4 will save me from myself) but I didn't realise it was doing
> > > something rather than just hanging. Next time I have to restart a node
> > I'll
> > > just leave and go get a cup of coffee or something.
> > >
> > > My configuration is set to auto hard-commit every 5 minutes. No auto
> > > soft-commit time is set.
> > >
> > > Over the course of the weekend, while left unattended the nodes have been
> > > going up and down (I've got to solve the issue that is causing them to
> > come
> > > and go, but any suggestions on what is likely to be causing something
> > like
> > > that are welcome), at one point one of the nodes stopped taking updates.
> > > After indexing properly for a few hours with that one shard not accepting
> > > updates, the replica of that shard which contains all the correct
> > documents
> > > must have replicated from the broken node and dropped documents. Is there
> > > any protection against this in Solr or should I be focusing on getting my
> > > nodes to be more reliable? I've now got a situation where four of my five
> > > shards have leaders who are marked as down and followers who are up.
> > >
> > > I'm going to start grabbing information about the cluster state so I can
> > > track which changes are happening and in what order. I can get hold of
> > Solr
> > > logs and garbage collection logs while these things are happening.
> > >
> > > Is this all just down to my nodes being unreliable?
> > >
> > >
> > > On 21 July 2013 13:52, Erick Erickson <erickerick...@gmail.com> wrote:
> > >
> > >> Well, if I'm reading this right you had a node go out of circulation
> > >> and then bounced nodes until that node became the leader. So of course
> > >> it wouldn't have the documents (how could it?). Basically you shot
> > >> yourself in the foot.
> > >>
> > >> Underlying here is why it took the machine you were re-starting so
> > >> long to come up that you got impatient and started killing nodes.
> > >> There has been quite a bit done to make that process better, so what
> > >> version of Solr are you using? 4.4 is being voted on right now, so if
> > >> you might want to consider upgrading.
> > >>
> > >> There was, for instance, a situation where it would take 3 minutes for
> > >> machines to start up. How impatient were you?
> > >>
> > >> Also, what are your hard commit parameters? All of the documents
> > >> you're indexing will be in the transaction log between hard commits,
> > >> and when a node comes up the leader will replay everything in the tlog
> > >> to the new node, which might be a source of why it took so long for
> > >> the new node to come back up. At the very least the new node you were
> > >> bringing back online will need to do a full index replication (old
> > >> style) to get caught up.
> > >>
> > >> Best
> > >> Erick
> > >>
> > >> On Fri, Jul 19, 2013 at 4:02 AM, Neil Prosser <neil.pros...@gmail.com>
> > >> wrote:
> > >> > While indexing some documents to a SolrCloud cluster (10 machines, 5
> > >> shards
> > >> > and 2 replicas, so one replica on each machine) one of the replicas
> > >> stopped
> > >> > receiving documents, while the other replica of the shard continued to
> > >> grow.
> > >> >
> > >> > That was overnight so I was unable to track exactly what happened (I'm
> > >> > going off our Graphite graphs here). This morning when I was able to
> > look
> > >> > at the cluster both replicas of that shard were marked as down (with
> > one
> > >> > marked as leader). I attempted to restart the non-leader node but it
> > >> took a
> > >> > long time to restart so I killed it and restarted the old leader,
> > which
> > >> > also took a long time. I killed that one (I'm impatient) and left the
> > >> > non-leader node to restart, not realising it was missing approximately
> > >> 700k
> > >> > documents that the old leader had. Eventually it restarted and became
> > >> > leader. I restarted the old leader and it dropped the number of
> > documents
> > >> > it had to match the previous non-leader.
> > >> >
> > >> > Is this expected behaviour when a replica with fewer documents is
> > started
> > >> > before the other and elected leader? Should I have been paying more
> > >> > attention to the number of documents on the server before restarting
> > >> nodes?
> > >> >
> > >> > I am still in the process of tuning the caches and warming for these
> > >> > servers but we are putting some load through the cluster so it is
> > >> possible
> > >> > that the nodes are having to work quite hard when a new version of the
> > >> core
> > >> > comes is made available. Is this likely to explain why I occasionally
> > see
> > >> > nodes dropping out? Unfortunately in restarting the nodes I lost the
> > GC
> > >> > logs to see whether that was likely to be the culprit. Is this the
> > sort
> > >> of
> > >> > situation where you raise the ZooKeeper timeout a bit? Currently the
> > >> > timeout for all nodes is 15 seconds.
> > >> >
> > >> > Are there any known issues which might explain what's happening? I'm
> > just
> > >> > getting started with SolrCloud after using standard master/slave
> > >> > replication for an index which has got too big for one machine over
> > the
> > >> > last few months.
> > >> >
> > >> > Also, is there any particular information that would be helpful to
> > help
> > >> > with these issues if it should happen again?
> > >>
> >
>