Just an update on this issue: We tried by increasing zookeeper client timeout settings to 30000ms in solr.xml (i think default is 15000ms), and haven't seen any issues from our tests. <cores ......... zkClientTimeout="30000" >
Thanks, Sudhakar. On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni <maddineni...@gmail.com>wrote: > We saw this error again today during our load test - basically, whenever > session is getting expired on the leader node, we are seeing the > error.After this happens, leader(001) is going into 'recovery' mode and all > the index updates are failing with "503- service unavailable" error > message.After some time(once recovery is successful), roles are swapped > i.e. 001 acting as the replica and 003 as leader. > > Btw, do you know why the connection to zookeeper[solr->zk] getting > interrupted in the middle? > is it because of the load(no of updates) we are putting on the cluster? > > Here is the exception stack trace: > > *Dec* *7*, *2012* *2:28:03* *PM* > *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader* > *WARNING:* *org.apache.zookeeper.KeeperException$SessionExpiredException:* > *KeeperErrorCode* *=* *Session* *expired* *for* */overseer_elect/leader* > *at* > *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*) > *at* > *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*) > *at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927* > ) > *at* > *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*) > *at* > *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*) > *at* > *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*) > *at* > *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*) > *at* > *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*) > *at* > *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*) > *at* *java.lang.Thread.run*(*Unknown* *Source*) > > Thx,Sudhakar. > > > > On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni <maddineni...@gmail.com > > wrote: > >> Erick: >> Not seeing any page caching related issues... >> >> Mark: >> 1.Would this "waiting" on 003(replica) cause any inconsistencies in the >> zookeeper cluster state? I was also looking at the leader(001) logs at that >> time and seeing errors related to "*SEVERE: ClusterState says we are the >> leader, but locally we don't think so*". >> 2.Also, all of our servers in cluster were gone down when the index >> updates were running in parallel along with this issue.Do you see this >> related to the session expiry on 001? >> >> Here are the logs on 001 >> ========================= >> >> Dec 4, 2012 12:12:29 PM >> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader >> WARNING: >> org.apache.zookeeper.KeeperException$SessionExpiredException: >> KeeperErrorCode = Session expired for /overseer_elect/leader >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:118) >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) >> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927) >> Dec 4, 2012 12:12:29 PM >> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader >> INFO: According to ZK I >> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a leader. >> >> Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.OverseerCollectionProcessor >> run >> WARNING: Overseer cannot talk to ZK >> >> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log >> SEVERE: There was a problem finding the leader in >> zk:org.apache.solr.common.SolrException: Could not get leader props >> at >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709) >> at >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673) >> Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log >> SEVERE: There was a problem finding the leader in >> zk:org.apache.solr.common.SolrException: Could not get leader props >> at >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709) >> at >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673) >> Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log >> SEVERE: There was a problem making a request to the >> leader:org.apache.solr.common.SolrException: I was asked to wait on state >> down for <001>:8080_solr but I still do not see the request state. I see >> state: active live:true >> at >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401) >> Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log >> SEVERE: There was a problem making a request to the >> leader:org.apache.solr.common.SolrException: I was asked to wait on state >> down for <001>:8080_solr but I still do not see the request state. I see >> state: active live:true >> at >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401) >> .... >> .... >> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log >> SEVERE: There was a problem finding the leader in >> zk:org.apache.solr.common.SolrException: Could not get leader props >> at >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709) >> .... >> .... >> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log >> SEVERE: :org.apache.solr.common.SolrException: There was a problem >> finding the leader in zk >> at >> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080) >> at >> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273) >> Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader >> SEVERE: Error getting leader from zk >> org.apache.solr.common.SolrException: *There is conflicting information >> about the leader of shard: shard1 our state >> says:http://<001>:8080/solr/core1/ >> but zookeeper says:http://<003>:8080/solr/core1/* >> * at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)* >> * at org.apache.solr.cloud.ZkController.register(ZkController.java:577)* >> Dec 4, 2012 12:22:30 PM >> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess >> INFO: Running the leader process. >> .... >> .... >> >> Thanks for your inputs. >> Sudhakar. >> >> >> >> >> >> >> >> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmil...@gmail.com>wrote: >> >>> Yes - it means that 001 went down (or more likely had it's connection to >>> ZooKeeper interrupted? that's what I mean about a session timeout - if the >>> solr->zk link is broken for longer than the session timeout that will >>> trigger a leader election and when the connection is reestablished, the >>> node will have to recover). That waiting should stop as soon as 001 came >>> back up or reconnected to ZooKeeper. >>> >>> In fact, this waiting should not happen in this case - but only on >>> cluster restart. This is a bug that is fixed in 4.1 (hopefully coming very >>> soon!): >>> >>> * SOLR-3940: Rejoining the leader election incorrectly triggers the code >>> path >>> for a fresh cluster start rather than fail over. (Mark Miller) >>> >>> - Mark >>> >>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <maddineni...@gmail.com> >>> wrote: >>> >>> > Yep, after restarting, cluster came back to normal state.We will run >>> couple of more tests and see if we could reproduce this issue. >>> > >>> > Btw, I am attaching the server logs before that 'INFO: Waiting until >>> we see more replicas' message.From the logs, we can see that leader >>> election process started on 003 which was the replica for 001 >>> initially.That means leader 001 went down at that time? >>> > >>> > logs on 003: >>> > ======== >>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext >>> runLeaderProcess >>> > INFO: Running the leader process. >>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext >>> shouldIBeLeader >>> > INFO: Checking if I should try and be the leader. >>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext >>> shouldIBeLeader >>> > INFO: My last published State was Active, it's okay to be the >>> leader. >>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext >>> runLeaderProcess >>> > INFO: I may be the new leader - try and sync >>> > 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close >>> > WARNING: Stopping recovery for zkNodeName=<003>:8080_solr_core >>> core=core1. >>> > 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync >>> > INFO: Sync replicas to http://<003>:8080/solr/core1/ >>> > 12:11:16 PM org.apache.solr.update.PeerSync sync >>> > INFO: PeerSync: core=core1 url=http://<003>:8080/solr START >>> replicas=[<001>:8080/solr/core1/] nUpdates=100 >>> > 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process >>> > INFO: Updating live nodes -> this message is on 002 >>> > 12:11:46 PM org.apache.solr.update.PeerSync handleResponse >>> > WARNING: PeerSync: core=core1 url=http://<003>:8080/solr >>> exception talking to <001>:8080/solr/core1/, failed >>> > org.apache.solr.client.solrj.SolrServerException: Timeout >>> occured while waiting response from server at: <001>:8080/solr/core1 >>> > at >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409) >>> > at >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>> > at >>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166) >>> > at >>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133) >>> > at java.util.concurrent.FutureTask$Sync.innerRun(Unknown >>> Source) >>> > at java.util.concurrent.FutureTask.run(Unknown Source) >>> > at >>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) >>> > at java.util.concurrent.FutureTask$Sync.innerRun(Unknown >>> Source) >>> > at java.util.concurrent.FutureTask.run(Unknown Source) >>> > at >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) >>> > at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) >>> > at java.lang.Thread.run(Unknown Source) >>> > Caused by: java.net.SocketTimeoutException: Read timed out >>> > at java.net.SocketInputStream.socketRead0(Native Method) >>> > at java.net.SocketInputStream.read(Unknown Source) >>> > 12:11:46 PM org.apache.solr.update.PeerSync sync >>> > INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE. >>> sync failed >>> > 12:11:46 PM org.apache.solr.common.SolrException log >>> > SEVERE: Sync Failed >>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext >>> rejoinLeaderElection >>> > INFO: There is a better leader candidate than us - going back >>> into recovery >>> > 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery >>> > INFO: Running recovery - first canceling any ongoing recovery >>> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run >>> > INFO: Starting recovery process. core=core1 >>> recoveringAfterStartup=false >>> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery >>> > INFO: Attempting to PeerSync from <001>:8080/solr/core1/ >>> core=core1 - recoveringAfterStartup=false >>> > 12:11:46 PM org.apache.solr.update.PeerSync sync >>> > INFO: PeerSync: core=core1 url=http://<003>:8080/solr START >>> replicas=[<001>:8080/solr/core1/] nUpdates=100 >>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext >>> runLeaderProcess >>> > INFO: Running the leader process. >>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext >>> waitForReplicasToComeUp >>> > INFO: Waiting until we see more replicas up: total=2 found=1 >>> timeoutin=179999 >>> > 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext >>> waitForReplicasToComeUp >>> > INFO: Waiting until we see more replicas up: total=2 found=1 >>> timeoutin=179495 >>> > 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext >>> waitForReplicasToComeUp >>> > INFO: Waiting until we see more replicas up: total=2 found=1 >>> timeoutin=178985 >>> > .... >>> > .... >>> > >>> > Thanks for your help. >>> > Sudhakar. >>> > >>> > On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <markrmil...@gmail.com> >>> wrote: >>> > The waiting logging had to happen on restart unless it's some kind of >>> bug. >>> > >>> > Beyond that, something is off, but I have no clue why - it seems your >>> clusterstate.json is not up to date at all. >>> > >>> > Have you tried restarting the cluster then? Does that help at all? >>> > >>> > Do you see any exceptions around zookeeper session timeouts? >>> > >>> > - Mark >>> > >>> > On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <maddineni...@gmail.com> >>> wrote: >>> > >>> > > Hey Mark, >>> > > >>> > > Yes, I am able to access all of the nodes under each shard from >>> solrcloud >>> > > admin UI. >>> > > >>> > > >>> > > - *It kind of looks like the urls solrcloud is using are not >>> accessible. >>> > > When you go to the admin page and the cloud tab, can you access >>> the urls it >>> > > shows for each shard? That is, if you click on of the links or >>> copy and >>> > > paste the address into a web browser, does it work?* >>> > > >>> > > Actually, I got these errors when my document upload task/job was >>> running, >>> > > not during the cluster restart. Also,job ran fine initially for the >>> first >>> > > one hour and started throwing these errors after indexing some docx. >>> > > >>> > > Thx, Sudhakar. >>> > > >>> > > >>> > > >>> > > >>> > > On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <markrmil...@gmail.com> >>> wrote: >>> > > >>> > >> It kind of looks like the urls solrcloud is using are not >>> accessible. When >>> > >> you go to the admin page and the cloud tab, can you access the urls >>> it >>> > >> shows for each shard? That is, if you click on of the links or copy >>> and >>> > >> paste the address into a web browser, does it work? >>> > >> >>> > >> You may have to explicitly set the host= in solr.xml if it's not >>> auto >>> > >> detecting the right one. Make sure the ports like right too. >>> > >> >>> > >>> waitForReplicasToComeUp >>> > >>> INFO: Waiting until we see more replicas up: total=2 found=1 >>> > >>> timeoutin=179999 >>> > >> >>> > >> That happens when you stop the cluster and try to start it again - >>> before >>> > >> a leader is chosen, it will wait for all known replicas fora shard >>> to come >>> > >> up so that everyone can sync up and have a chance to be the best >>> leader. So >>> > >> at this point it was only finding one of 2 known replicas and >>> waiting for >>> > >> the second to come up. After a couple minutes (configurable) it >>> will just >>> > >> continue anyway without the missing replica (if it doesn't show up). >>> > >> >>> > >> - Mark >>> > >> >>> > >> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni < >>> maddineni...@gmail.com> >>> > >> wrote: >>> > >> >>> > >>> Hi, >>> > >>> We are uploading solr documents to the index in batches using 30 >>> threads >>> > >>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit >>> set to >>> > >>> 10000. >>> > >>> In the code, we are using HttpSolrServer and add(inputDoc) method >>> to add >>> > >>> docx. >>> > >>> And, we have the following commit settings in solrconfig: >>> > >>> >>> > >>> <autoCommit> >>> > >>> <maxTime>300000</maxTime> >>> > >>> <maxDocs>10000</maxDocs> >>> > >>> <openSearcher>false</openSearcher> >>> > >>> </autoCommit> >>> > >>> >>> > >>> <autoSoftCommit> >>> > >>> <maxTime>1000</maxTime> >>> > >>> </autoSoftCommit> >>> > >>> >>> > >>> Cluster Details: >>> > >>> ---------------------------- >>> > >>> solr version - 4.0 >>> > >>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes] >>> > >>> numshards=2 , >>> > >>> 001, 002, 003 are the solr nodes and these three are behind the >>> > >>> loadbalancer <vip> >>> > >>> 001, 003 assigned to shard1; 002 assigned to shard2 >>> > >>> >>> > >>> >>> > >>> Logs:Getting the errors in the below sequence after uploading some >>> docx: >>> > >>> >>> > >> >>> ----------------------------------------------------------------------------------------------------------- >>> > >>> 003 >>> > >>> Dec 4, 2012 12:11:46 PM >>> org.apache.solr.cloud.ShardLeaderElectionContext >>> > >>> waitForReplicasToComeUp >>> > >>> INFO: Waiting until we see more replicas up: total=2 found=1 >>> > >>> timeoutin=179999 >>> > >>> >>> > >>> 001 >>> > >>> Dec 4, 2012 12:12:59 PM >>> > >>> org.apache.solr.update.processor.DistributedUpdateProcessor >>> > >>> doDefensiveChecks >>> > >>> SEVERE: ClusterState says we are the leader, but locally we don't >>> think >>> > >> so >>> > >>> >>> > >>> 003 >>> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log >>> > >>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed - >>> retrying ... >>> > >>> >>> > >>> 001 >>> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log >>> > >>> SEVERE: Error uploading: org.apache.solr.common.SolrException: >>> Server at >>> > >>> <vip>/solr/core1. returned non ok status:503, message:Service >>> Unavailable >>> > >>> at >>> > >>> >>> > >> >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) >>> > >>> at >>> > >>> >>> > >> >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>> > >>> 001 >>> > >>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log >>> > >>> SEVERE: Error while trying to recover. >>> > >>> core=core1:org.apache.solr.common.SolrException: We are not the >>> leader >>> > >>> at >>> > >>> >>> > >> >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401) >>> > >>> >>> > >>> 001 >>> > >>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log >>> > >>> SEVERE: Error uploading: >>> > >> org.apache.solr.client.solrj.SolrServerException: >>> > >>> IOException occured when talking to server at <vip>/solr/core1 >>> > >>> at >>> > >>> >>> > >> >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413) >>> > >>> at >>> > >>> >>> > >> >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>> > >>> at >>> > >>> >>> > >> >>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) >>> > >>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116) >>> > >>> ... 5 lines omitted ... >>> > >>> at java.lang.Thread.run(Unknown Source) >>> > >>> Caused by: java.net.SocketException: Connection reset >>> > >>> >>> > >>> >>> > >>> After sometime, all the three servers are going down. >>> > >>> >>> > >>> Appreciate, if someone could let us know what we are missing. >>> > >>> >>> > >>> Thx,Sudhakar. >>> > >> >>> > >> >>> > >>> > >>> > <logs_error.txt> >>> >>> >> >