Ok, so clearing the transaction log allowed things to go again. I am going to clear the index and try to replicate the problem on 4.2.0 and then I'll try on 4.2.1
On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmil...@gmail.com> wrote: > No, not that I know if, which is why I say we need to get to the bottom of > it. > > - Mark > > On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2...@gmail.com> wrote: > > > Mark > > It's there a particular jira issue that you think may address this? I > read > > through it quickly but didn't see one that jumped out > > On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2...@gmail.com> wrote: > > > >> I brought the bad one down and back up and it did nothing. I can clear > >> the index and try4.2.1. I will save off the logs and see if there is > >> anything else odd > >> On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmil...@gmail.com> wrote: > >> > >>> It would appear it's a bug given what you have said. > >>> > >>> Any other exceptions would be useful. Might be best to start tracking > in > >>> a JIRA issue as well. > >>> > >>> To fix, I'd bring the behind node down and back again. > >>> > >>> Unfortunately, I'm pressed for time, but we really need to get to the > >>> bottom of this and fix it, or determine if it's fixed in 4.2.1 > (spreading > >>> to mirrors now). > >>> > >>> - Mark > >>> > >>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2...@gmail.com> wrote: > >>> > >>>> Sorry I didn't ask the obvious question. Is there anything else that > I > >>>> should be looking for here and is this a bug? I'd be happy to troll > >>>> through the logs further if more information is needed, just let me > >>> know. > >>>> > >>>> Also what is the most appropriate mechanism to fix this. Is it > >>> required to > >>>> kill the index that is out of sync and let solr resync things? > >>>> > >>>> > >>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2...@gmail.com> > >>> wrote: > >>>> > >>>>> sorry for spamming here.... > >>>>> > >>>>> shard5-core2 is the instance we're having issues with... > >>>>> > >>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log > >>>>> SEVERE: shard update error StdNode: > >>>>> > >>> > http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException > >>> : > >>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non > >>> ok > >>>>> status:503, message:Service Unavailable > >>>>> at > >>>>> > >>> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) > >>>>> at > >>>>> > >>> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > >>>>> at > >>>>> > >>> > org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) > >>>>> at > >>>>> > >>> > org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) > >>>>> at > >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>>>> at > >>>>> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > >>>>> at > >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>>>> at > >>>>> > >>> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >>>>> at > >>>>> > >>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >>>>> at java.lang.Thread.run(Thread.java:662) > >>>>> > >>>>> > >>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <jej2...@gmail.com> > >>> wrote: > >>>>> > >>>>>> here is another one that looks interesting > >>>>>> > >>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log > >>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we > are > >>>>>> the leader, but locally we don't think so > >>>>>> at > >>>>>> > >>> > org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293) > >>>>>> at > >>>>>> > >>> > org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228) > >>>>>> at > >>>>>> > >>> > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339) > >>>>>> at > >>>>>> > >>> > org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) > >>>>>> at > >>>>>> > >>> > org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) > >>>>>> at > >>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) > >>>>>> at > >>>>>> > >>> > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) > >>>>>> at > >>>>>> > >>> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > >>>>>> at > >>>>>> > >>> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > >>>>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797) > >>>>>> at > >>>>>> > >>> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637) > >>>>>> at > >>>>>> > >>> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343) > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <jej2...@gmail.com> > >>> wrote: > >>>>>> > >>>>>>> Looking at the master it looks like at some point there were shards > >>> that > >>>>>>> went down. I am seeing things like what is below. > >>>>>>> > >>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected > >>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred - > >>> updating... (live > >>>>>>> nodes size: 12) > >>>>>>> Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3 > >>>>>>> process > >>>>>>> INFO: Updating live nodes... (9) > >>>>>>> Apr 2, 2013 8:12:52 PM > >>> org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>> runLeaderProcess > >>>>>>> INFO: Running the leader process. > >>>>>>> Apr 2, 2013 8:12:52 PM > >>> org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>> shouldIBeLeader > >>>>>>> INFO: Checking if I should try and be the leader. > >>>>>>> Apr 2, 2013 8:12:52 PM > >>> org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>> shouldIBeLeader > >>>>>>> INFO: My last published State was Active, it's okay to be the > leader. > >>>>>>> Apr 2, 2013 8:12:52 PM > >>> org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>> runLeaderProcess > >>>>>>> INFO: I may be the new leader - try and sync > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <markrmil...@gmail.com > >>>> wrote: > >>>>>>> > >>>>>>>> I don't think the versions you are thinking of apply here. > Peersync > >>>>>>>> does not look at that - it looks at version numbers for updates in > >>> the > >>>>>>>> transaction log - it compares the last 100 of them on leader and > >>> replica. > >>>>>>>> What it's saying is that the replica seems to have versions that > >>> the leader > >>>>>>>> does not. Have you scanned the logs for any interesting > exceptions? > >>>>>>>> > >>>>>>>> Did the leader change during the heavy indexing? Did any zk > session > >>>>>>>> timeouts occur? > >>>>>>>> > >>>>>>>> - Mark > >>>>>>>> > >>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2...@gmail.com> > >>> wrote: > >>>>>>>> > >>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and > >>> noticed a > >>>>>>>>> strange issue while testing today. Specifically the replica has > a > >>>>>>>> higher > >>>>>>>>> version than the master which is causing the index to not > >>> replicate. > >>>>>>>>> Because of this the replica has fewer documents than the master. > >>> What > >>>>>>>>> could cause this and how can I resolve it short of taking down > the > >>>>>>>> index > >>>>>>>>> and scping the right version in? > >>>>>>>>> > >>>>>>>>> MASTER: > >>>>>>>>> Last Modified:about an hour ago > >>>>>>>>> Num Docs:164880 > >>>>>>>>> Max Doc:164880 > >>>>>>>>> Deleted Docs:0 > >>>>>>>>> Version:2387 > >>>>>>>>> Segment Count:23 > >>>>>>>>> > >>>>>>>>> REPLICA: > >>>>>>>>> Last Modified: about an hour ago > >>>>>>>>> Num Docs:164773 > >>>>>>>>> Max Doc:164773 > >>>>>>>>> Deleted Docs:0 > >>>>>>>>> Version:3001 > >>>>>>>>> Segment Count:30 > >>>>>>>>> > >>>>>>>>> in the replicas log it says this: > >>>>>>>>> > >>>>>>>>> INFO: Creating new http client, > >>>>>>>>> > >>>>>>>> > >>> > config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false > >>>>>>>>> > >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync > >>>>>>>>> > >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 > >>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[ > >>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100 > >>>>>>>>> > >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync > >>> handleVersions > >>>>>>>>> > >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= > >>>>>>>> http://10.38.33.17:7577/solr > >>>>>>>>> Received 100 versions from > 10.38.33.16:7575/solr/dsc-shard5-core1/ > >>>>>>>>> > >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync > >>> handleVersions > >>>>>>>>> > >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= > >>>>>>>> http://10.38.33.17:7577/solr Our > >>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944 > >>>>>>>>> otherHigh=1431233789440294912 > >>>>>>>>> > >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync > >>>>>>>>> > >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 > >>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> which again seems to point that it thinks it has a newer version > of > >>>>>>>> the > >>>>>>>>> index so it aborts. This happened while having 10 threads > indexing > >>>>>>>> 10,000 > >>>>>>>>> items writing to a 6 shard (1 replica each) cluster. Any > thoughts > >>> on > >>>>>>>> this > >>>>>>>>> or what I should look for would be appreciated. > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>> > >>> > >