Thanks Anshum. This is great to know. If any of you can share your experience with restarting such massive clusters, that will greatly help
On Wed, Aug 13, 2014 at 3:19 PM, Anshum Gupta <ans...@anshumgupta.net> wrote: > Hi Nitin, > > There's already an issue for breaking the clusterstate.json. Here's the > link: > https://issues.apache.org/jira/browse/SOLR-5473 > > A lot of work has already been done on that one and hopefully, it > should be in trunk soon. > > > On Wed, Aug 13, 2014 at 3:13 PM, KNitin <nitin.t...@gmail.com> wrote: > > Thanks, Mark. Yes I keep track of the overseer and restart it in the end. > > The only thing that i observe is that as the zookeeper cluster state file > > grows, this behavior gets worse. I notice the following issues > > > > 1. Two nodes (different replicas for the same shard) get stuck in > > recovering state without either becoming a leader. I thought zk was > meant > > to break ties but doesnt help > > 2. If the recovery fails on a replica, it gets stuck retrying for a > very > > long time (in the order of tens of minutes) before it finally giving > > up/recovering > > 3. There have been cases 1000 collections restart successfully but > takes > > over 2 hours (because of #2) > > > > The cluster state json file is continuously being updated as the cluster > > restarts (to update core status). Has anyone see this being a big > > bottleneck? Does zookeeper locking files for writes cause a huge issue > > while restarting solr? > > > > Also a side question: Why do we need to have a global cluster state json? > > Is it better to break it down to a per collection state json file? > > > > Thanks for all your help! > > Nitin > > > > > > > > > > On Wed, Aug 13, 2014 at 9:15 AM, Mark Miller <markrmil...@gmail.com> > wrote: > > > >> That is good testing :) We should track down what is up with that 30%. > >> Might open a JIRA with some logs. > >> > >> It can help if you restart the overseer node last. > >> > >> There are likely some improvements around this post 4.6. > >> > >> -- > >> Mark Miller > >> about.me/markrmiller > >> > >> On August 13, 2014 at 12:05:27 PM, KNitin (nitin.t...@gmail.com) wrote: > >> > Thank u all! Yes I want to disable it for testing purposes > >> > > >> > The main issue is that rolling restart of solrcloud for 1000 > collections > >> is > >> > extremely unreliable and slow. More than 30% of the collections fail > to > >> > recover. > >> > > >> > What are some good guidelines to follow while restarting a massive > >> cluster > >> > like this ? > >> > > >> > Are there any new improvements (post 4.6) in solr that helps restarts > to > >> be > >> > more robust ? > >> > > >> > Thanks > >> > > >> > On Sunday, August 10, 2014, rulinma wrote: > >> > > >> > > good. > >> > > > >> > > > >> > > > >> > > -- > >> > > View this message in context: > >> > > > >> > http://lucene.472066.n3.nabble.com/Disabling-transaction-logs-tp4151721p4152222.html > >> > > Sent from the Solr - User mailing list archive at Nabble.com. > >> > > > >> > > >> > >> > > > > -- > > Anshum Gupta > http://www.anshumgupta.net >