Re: SolrCloud - ClusterState says we are the leader,but locally ...

Sudhakar Maddineni Tue, 11 Dec 2012 10:51:00 -0800

Just an update on this issue:
   We tried by increasing zookeeper client timeout settings to 30000ms in
solr.xml (i think default is 15000ms), and haven't seen any issues from our
tests.
<cores .........           zkClientTimeout="30000" >


Thanks, Sudhakar.

On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni
<maddineni...@gmail.com>wrote:

> We saw this error again today during our load test - basically, whenever
> session is getting expired on the leader node, we are seeing the
> error.After this happens, leader(001) is going into 'recovery' mode and all
> the index updates are failing with "503- service unavailable" error
> message.After some time(once recovery is successful), roles are swapped
> i.e. 001 acting as the replica and 003 as leader.
>
> Btw, do you know why the connection to zookeeper[solr->zk] getting
> interrupted in the middle?
> is it because of the load(no of updates) we are putting on the cluster?
>
> Here is the exception stack trace:
>
> *Dec* *7*, *2012* *2:28:03* *PM* 
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader*
> *WARNING:* *org.apache.zookeeper.KeeperException$SessionExpiredException:* 
> *KeeperErrorCode* *=* *Session* *expired* *for* */overseer_elect/leader*
>       *at* 
> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
>       *at* 
> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
>       *at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*
> )
>       *at* 
> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
>       *at* 
> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
>       *at* 
> *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
>       *at* 
> *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
>       *at* 
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
>       *at* 
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
>       *at* *java.lang.Thread.run*(*Unknown* *Source*)
>
> Thx,Sudhakar.
>
>
>
> On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni <maddineni...@gmail.com
> > wrote:
>
>> Erick:
>>   Not seeing any page caching related issues...
>>
>> Mark:
>>   1.Would this "waiting" on 003(replica) cause any inconsistencies in the
>> zookeeper cluster state? I was also looking at the leader(001) logs at that
>> time and seeing errors related to "*SEVERE: ClusterState says we are the
>> leader, but locally we don't think so*".
>>   2.Also, all of our servers in cluster were gone down when the index
>> updates were running in parallel along with this issue.Do you see this
>> related to the session expiry on 001?
>>
>> Here are the logs on 001
>> =========================
>>
>> Dec 4, 2012 12:12:29 PM
>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
>> WARNING:
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> KeeperErrorCode = Session expired for /overseer_elect/leader
>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
>> Dec 4, 2012 12:12:29 PM
>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
>> INFO: According to ZK I
>> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a leader.
>>
>> Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.OverseerCollectionProcessor
>> run
>> WARNING: Overseer cannot talk to ZK
>>
>> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
>> SEVERE: There was a problem finding the leader in
>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>  at
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>> at
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>>  Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
>> SEVERE: There was a problem finding the leader in
>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>  at
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>> at
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>>  Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
>> SEVERE: There was a problem making a request to the
>> leader:org.apache.solr.common.SolrException: I was asked to wait on state
>> down for <001>:8080_solr but I still do not see the request state. I see
>> state: active live:true
>>  at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>  Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
>> SEVERE: There was a problem making a request to the
>> leader:org.apache.solr.common.SolrException: I was asked to wait on state
>> down for <001>:8080_solr but I still do not see the request state. I see
>> state: active live:true
>>  at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>> ....
>>  ....
>> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
>> SEVERE: There was a problem finding the leader in
>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>  at
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>> ....
>>  ....
>> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
>> SEVERE: :org.apache.solr.common.SolrException: There was a problem
>> finding the leader in zk
>>  at
>> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
>> at
>> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
>>  Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
>> SEVERE: Error getting leader from zk
>> org.apache.solr.common.SolrException: *There is conflicting information
>> about the leader of shard: shard1 our state 
>> says:http://<001>:8080/solr/core1/
>> but zookeeper says:http://<003>:8080/solr/core1/*
>> * at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
>> * at org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
>>  Dec 4, 2012 12:22:30 PM
>> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
>> INFO: Running the leader process.
>>  ....
>> ....
>>
>> Thanks for your inputs.
>> Sudhakar.
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmil...@gmail.com>wrote:
>>
>>> Yes - it means that 001 went down (or more likely had it's connection to
>>> ZooKeeper interrupted? that's what I mean about a session timeout - if the
>>> solr->zk link is broken for longer than the session timeout that will
>>> trigger a leader election and when the connection is reestablished, the
>>> node will have to recover). That waiting should stop as soon as 001 came
>>> back up or reconnected to ZooKeeper.
>>>
>>> In fact, this waiting should not happen in this case - but only on
>>> cluster restart. This is a bug that is fixed in 4.1 (hopefully coming very
>>> soon!):
>>>
>>> * SOLR-3940: Rejoining the leader election incorrectly triggers the code
>>> path
>>>   for a fresh cluster start rather than fail over. (Mark Miller)
>>>
>>> - Mark
>>>
>>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <maddineni...@gmail.com>
>>> wrote:
>>>
>>> > Yep, after restarting, cluster came back to normal state.We will run
>>> couple of more tests and see if we could reproduce this issue.
>>> >
>>> > Btw, I am attaching the server logs before that 'INFO: Waiting until
>>> we see more replicas'  message.From the logs, we can see that leader
>>> election process started on 003 which was the replica for 001
>>> initially.That means leader 001 went down at that time?
>>> >
>>> > logs on 003:
>>> > ========
>>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> runLeaderProcess
>>> >         INFO: Running the leader process.
>>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> shouldIBeLeader
>>> >         INFO: Checking if I should try and be the leader.
>>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> shouldIBeLeader
>>> >         INFO: My last published State was Active, it's okay to be the
>>> leader.
>>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> runLeaderProcess
>>> >         INFO: I may be the new leader - try and sync
>>> > 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
>>> >         WARNING: Stopping recovery for zkNodeName=<003>:8080_solr_core
>>> core=core1.
>>> > 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
>>> >         INFO: Sync replicas to http://<003>:8080/solr/core1/
>>> > 12:11:16 PM org.apache.solr.update.PeerSync sync
>>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>>> > 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
>>> >         INFO: Updating live nodes -> this message is on 002
>>> > 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
>>> >         WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
>>>  exception talking to <001>:8080/solr/core1/, failed
>>> >         org.apache.solr.client.solrj.SolrServerException: Timeout
>>> occured while waiting response from server at: <001>:8080/solr/core1
>>> >               at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
>>> >               at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> >               at
>>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>>> >               at
>>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>>> >               at java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>>> Source)
>>> >               at java.util.concurrent.FutureTask.run(Unknown Source)
>>> >               at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>>> >               at java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>>> Source)
>>> >               at java.util.concurrent.FutureTask.run(Unknown Source)
>>> >               at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>>> >               at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>> >               at java.lang.Thread.run(Unknown Source)
>>> >         Caused by: java.net.SocketTimeoutException: Read timed out
>>> >               at java.net.SocketInputStream.socketRead0(Native Method)
>>> >               at java.net.SocketInputStream.read(Unknown Source)
>>> > 12:11:46 PM org.apache.solr.update.PeerSync sync
>>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE.
>>> sync failed
>>> > 12:11:46 PM org.apache.solr.common.SolrException log
>>> >         SEVERE: Sync Failed
>>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> rejoinLeaderElection
>>> >         INFO: There is a better leader candidate than us - going back
>>> into recovery
>>> > 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
>>> >         INFO: Running recovery - first canceling any ongoing recovery
>>> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
>>> >         INFO: Starting recovery process.  core=core1
>>> recoveringAfterStartup=false
>>> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
>>> >         INFO: Attempting to PeerSync from <001>:8080/solr/core1/
>>> core=core1 - recoveringAfterStartup=false
>>> > 12:11:46 PM org.apache.solr.update.PeerSync sync
>>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> runLeaderProcess
>>> >         INFO: Running the leader process.
>>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> waitForReplicasToComeUp
>>> >         INFO: Waiting until we see more replicas up: total=2 found=1
>>> timeoutin=179999
>>> > 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> waitForReplicasToComeUp
>>> >         INFO: Waiting until we see more replicas up: total=2 found=1
>>> timeoutin=179495
>>> > 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> waitForReplicasToComeUp
>>> >         INFO: Waiting until we see more replicas up: total=2 found=1
>>> timeoutin=178985
>>> > ....
>>> > ....
>>> >
>>> > Thanks for your help.
>>> > Sudhakar.
>>> >
>>> > On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <markrmil...@gmail.com>
>>> wrote:
>>> > The waiting logging had to happen on restart unless it's some kind of
>>> bug.
>>> >
>>> > Beyond that, something is off, but I have no clue why - it seems your
>>> clusterstate.json is not up to date at all.
>>> >
>>> > Have you tried restarting the cluster then? Does that help at all?
>>> >
>>> > Do you see any exceptions around zookeeper session timeouts?
>>> >
>>> > - Mark
>>> >
>>> > On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <maddineni...@gmail.com>
>>> wrote:
>>> >
>>> > > Hey Mark,
>>> > >
>>> > > Yes, I am able to access all of the nodes under each shard from
>>> solrcloud
>>> > > admin UI.
>>> > >
>>> > >
>>> > >   - *It kind of looks like the urls solrcloud is using are not
>>> accessible.
>>> > >   When you go to the admin page and the cloud tab, can you access
>>> the urls it
>>> > >   shows for each shard? That is, if you click on of the links or
>>> copy and
>>> > >   paste the address into a web browser, does it work?*
>>> > >
>>> > > Actually, I got these errors when my document upload task/job was
>>> running,
>>> > > not during the cluster restart. Also,job ran fine initially for the
>>> first
>>> > > one hour and started throwing these errors after indexing some docx.
>>> > >
>>> > > Thx, Sudhakar.
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <markrmil...@gmail.com>
>>> wrote:
>>> > >
>>> > >> It kind of looks like the urls solrcloud is using are not
>>> accessible. When
>>> > >> you go to the admin page and the cloud tab, can you access the urls
>>> it
>>> > >> shows for each shard? That is, if you click on of the links or copy
>>> and
>>> > >> paste the address into a web browser, does it work?
>>> > >>
>>> > >> You may have to explicitly set the host= in solr.xml if it's not
>>> auto
>>> > >> detecting the right one. Make sure the ports like right too.
>>> > >>
>>> > >>> waitForReplicasToComeUp
>>> > >>> INFO: Waiting until we see more replicas up: total=2 found=1
>>> > >>> timeoutin=179999
>>> > >>
>>> > >> That happens when you stop the cluster and try to start it again -
>>> before
>>> > >> a leader is chosen, it will wait for all known replicas fora shard
>>> to come
>>> > >> up so that everyone can sync up and have a chance to be the best
>>> leader. So
>>> > >> at this point it was only finding one of 2 known replicas and
>>> waiting for
>>> > >> the second to come up. After a couple minutes (configurable) it
>>> will just
>>> > >> continue anyway without the missing replica (if it doesn't show up).
>>> > >>
>>> > >> - Mark
>>> > >>
>>> > >> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <
>>> maddineni...@gmail.com>
>>> > >> wrote:
>>> > >>
>>> > >>> Hi,
>>> > >>> We are uploading solr documents to the index in batches using 30
>>> threads
>>> > >>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit
>>> set to
>>> > >>> 10000.
>>> > >>> In the code, we are using HttpSolrServer and add(inputDoc) method
>>> to add
>>> > >>> docx.
>>> > >>> And, we have the following commit settings in solrconfig:
>>> > >>>
>>> > >>>    <autoCommit>
>>> > >>>      <maxTime>300000</maxTime>
>>> > >>>      <maxDocs>10000</maxDocs>
>>> > >>>      <openSearcher>false</openSearcher>
>>> > >>>    </autoCommit>
>>> > >>>
>>> > >>>      <autoSoftCommit>
>>> > >>>        <maxTime>1000</maxTime>
>>> > >>>      </autoSoftCommit>
>>> > >>>
>>> > >>> Cluster Details:
>>> > >>> ----------------------------
>>> > >>> solr version - 4.0
>>> > >>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
>>> > >>> numshards=2 ,
>>> > >>> 001, 002, 003 are the solr nodes and these three are behind the
>>> > >>> loadbalancer  <vip>
>>> > >>> 001, 003 assigned to shard1; 002 assigned to shard2
>>> > >>>
>>> > >>>
>>> > >>> Logs:Getting the errors in the below sequence after uploading some
>>> docx:
>>> > >>>
>>> > >>
>>> -----------------------------------------------------------------------------------------------------------
>>> > >>> 003
>>> > >>> Dec 4, 2012 12:11:46 PM
>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>> > >>> waitForReplicasToComeUp
>>> > >>> INFO: Waiting until we see more replicas up: total=2 found=1
>>> > >>> timeoutin=179999
>>> > >>>
>>> > >>> 001
>>> > >>> Dec 4, 2012 12:12:59 PM
>>> > >>> org.apache.solr.update.processor.DistributedUpdateProcessor
>>> > >>> doDefensiveChecks
>>> > >>> SEVERE: ClusterState says we are the leader, but locally we don't
>>> think
>>> > >> so
>>> > >>>
>>> > >>> 003
>>> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>>> > >>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed -
>>> retrying ...
>>> > >>>
>>> > >>> 001
>>> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>>> > >>> SEVERE: Error uploading: org.apache.solr.common.SolrException:
>>> Server at
>>> > >>> <vip>/solr/core1. returned non ok status:503, message:Service
>>> Unavailable
>>> > >>> at
>>> > >>>
>>> > >>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>>> > >>> at
>>> > >>>
>>> > >>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> > >>> 001
>>> > >>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
>>> > >>> SEVERE: Error while trying to recover.
>>> > >>> core=core1:org.apache.solr.common.SolrException: We are not the
>>> leader
>>> > >>> at
>>> > >>>
>>> > >>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>> > >>>
>>> > >>> 001
>>> > >>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
>>> > >>> SEVERE: Error uploading:
>>> > >> org.apache.solr.client.solrj.SolrServerException:
>>> > >>> IOException occured when talking to server at <vip>/solr/core1
>>> > >>> at
>>> > >>>
>>> > >>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
>>> > >>> at
>>> > >>>
>>> > >>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> > >>> at
>>> > >>>
>>> > >>
>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>> > >>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
>>> > >>> ... 5 lines omitted ...
>>> > >>> at java.lang.Thread.run(Unknown Source)
>>> > >>> Caused by: java.net.SocketException: Connection reset
>>> > >>>
>>> > >>>
>>> > >>> After sometime, all the three servers are going down.
>>> > >>>
>>> > >>> Appreciate, if someone could let us know what we are missing.
>>> > >>>
>>> > >>> Thx,Sudhakar.
>>> > >>
>>> > >>
>>> >
>>> >
>>> > <logs_error.txt>
>>>
>>>
>>
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Reply via email to