Thanks Anshum. This is great to know. If any of you can share your
experience with restarting such massive clusters, that will greatly help




On Wed, Aug 13, 2014 at 3:19 PM, Anshum Gupta <ans...@anshumgupta.net>
wrote:

> Hi Nitin,
>
> There's already an issue for breaking the clusterstate.json. Here's the
> link:
> https://issues.apache.org/jira/browse/SOLR-5473
>
> A lot of work has already been done on that one and hopefully, it
> should be in trunk soon.
>
>
> On Wed, Aug 13, 2014 at 3:13 PM, KNitin <nitin.t...@gmail.com> wrote:
> > Thanks, Mark. Yes I keep track of the overseer and restart it in the end.
> > The only thing that i observe is that as the zookeeper cluster state file
> > grows, this behavior gets worse. I notice the following issues
> >
> >    1. Two nodes (different replicas for the same shard) get stuck in
> >    recovering state without either becoming a leader. I thought zk was
> meant
> >    to break ties but doesnt help
> >    2. If the recovery fails on a replica, it gets stuck retrying for a
> very
> >    long time (in the order of tens of minutes) before it finally giving
> >    up/recovering
> >    3. There have been cases 1000 collections restart successfully but
> takes
> >    over 2 hours (because of #2)
> >
> > The cluster state json file is continuously being updated as the cluster
> > restarts (to update core status). Has anyone see this being a big
> > bottleneck? Does zookeeper locking files for writes cause a huge issue
> > while restarting solr?
> >
> > Also a side question: Why do we need to have a global cluster state json?
> > Is it better to break it down to a per collection state json file?
> >
> > Thanks for all your help!
> > Nitin
> >
> >
> >
> >
> > On Wed, Aug 13, 2014 at 9:15 AM, Mark Miller <markrmil...@gmail.com>
> wrote:
> >
> >> That is good testing :) We should track down what is up with that 30%.
> >> Might open a JIRA with some logs.
> >>
> >> It can help if you restart the overseer node last.
> >>
> >> There are likely some improvements around this post 4.6.
> >>
> >> --
> >> Mark Miller
> >> about.me/markrmiller
> >>
> >> On August 13, 2014 at 12:05:27 PM, KNitin (nitin.t...@gmail.com) wrote:
> >> > Thank u all! Yes I want to disable it for testing purposes
> >> >
> >> > The main issue is that rolling restart of solrcloud for 1000
> collections
> >> is
> >> > extremely unreliable and slow. More than 30% of the collections fail
> to
> >> > recover.
> >> >
> >> > What are some good guidelines to follow while restarting a massive
> >> cluster
> >> > like this ?
> >> >
> >> > Are there any new improvements (post 4.6) in solr that helps restarts
> to
> >> be
> >> > more robust ?
> >> >
> >> > Thanks
> >> >
> >> > On Sunday, August 10, 2014, rulinma wrote:
> >> >
> >> > > good.
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > View this message in context:
> >> > >
> >>
> http://lucene.472066.n3.nabble.com/Disabling-transaction-logs-tp4151721p4152222.html
> >> > > Sent from the Solr - User mailing list archive at Nabble.com.
> >> > >
> >> >
> >>
> >>
>
>
>
> --
>
> Anshum Gupta
> http://www.anshumgupta.net
>

Reply via email to