Re: SolrCloud: ClusterState says we are the leader but locally we don't think so

2017-01-17 Thread Kelly, Frank
We bounced ZooKeeper nodes one by one but not change

Since this is our Prod server (100M+ docs) we don¹t want to have to
reindex from scratch (takes 7+ days)
So we¹re considering editing /collections//state.json via
zkcli.sh

Thoughts?

-Frank

 


On 1/17/17, 5:49 PM, "Pushkar Raste"  wrote:

>Try bouncing the overseer for your cluster.
>
>On Jan 17, 2017 12:01 PM, "Kelly, Frank"  wrote:
>
>> Solr Version: 5.3.1
>>
>> Configuration: 3 shards, 3 replicas each
>>
>> After running out of heap memory recently (cause unknown) we¹ve been
>> successfully restarting nodes to recover.
>>
>> Finally we did one restart and one of the nodes now says the following
>> 2017-01-17 16:57:16.835 ERROR (qtp1395089624-17)
>> [c:prod_us-east-1_here_account s:shard3 r:core_node26
>> x:prod_us-east-1_here_account_shard3_replica3] o.a.s.c.SolrCore
>> org.apache.solr.common.SolrException: ClusterState says we are the
>>leader
>> 
>>(https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2F10.255
>>.6.196%3A8983%2Fsolr%2Fprod_us-east-1_here_account_shard3_replica3=0
>>1%7C01%7C%7Ce6ff77fca8124af5cd7c08d43f2b2cf6%7C6d4034cd72254f72b85391feae
>>a64919%7C1=iN%2FH3MxId9b6aKcch%2FKSHYncsM5Ug6QJhRfGJ59G7uo%3D
>>ved=0),
>> but locally we don't think so. Request came from null
>>
>> How can we recover from this (for Solr 5.3.1)?
>> Is there someway to force a new leader (I know the following feature
>> exists but in 5.4.0
>>https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues
>>.apache.org%2Fjira%2Fbrowse%2FSOLR-7569=01%7C01%7C%7Ce6ff77fca8124af
>>5cd7c08d43f2b2cf6%7C6d4034cd72254f72b85391feaea64919%7C1=O7Ey7mUyuy
>>Kv%2FEPQvewITfxw8cmTEpUlV18hXGeIKUI%3D=0)
>>
>> Thanks!
>>
>> -Frank
>>
>> [image: Description: Macintosh
>> 
>>HD:Users:jerchow:Downloads:Asset_Package_01_160721:HERE_Logo_2016:sRGB:PD
>>F:HERE_Logo_2016_POS_sRGB.pdf]
>>
>>
>>
>> *Frank Kelly*
>>
>> *Principal Software Engineer*
>>
>>
>>
>> HERE
>>
>> 5 Wayside Rd, Burlington, MA 01803, USA
>>
>> *42° 29' 7" N 71° 11' 32" W*
>>
>>
>> [image: Description:
>> 
>>/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Ima
>>ges/20160726_HERE_EMail_Signature_360.gif]
>> 
>>>re.com%2F=01%7C01%7C%7Ce6ff77fca8124af5cd7c08d43f2b2cf6%7C6d4034cd72
>>254f72b85391feaea64919%7C1=wDqPoxFI%2BLRB7kZ%2FVkvGPZnz8vHK4VoeEjcK
>>%2FXxTo%2BU%3D=0>[image: Description:
>> 
>>/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Ima
>>ges/20160726_HERE_EMail_Signature_Twitter.gif]
>> 
>>>witter.com%2Fhere=01%7C01%7C%7Ce6ff77fca8124af5cd7c08d43f2b2cf6%7C6d
>>4034cd72254f72b85391feaea64919%7C1=La8C0bkCRhubUHz6KwLtMnLcv1Txglck
>>ua2YKDldM84%3D=0>   [image: Description:
>> 
>>/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Ima
>>ges/20160726_HERE_EMail_Signature_FB.gif]
>> 
>>>acebook.com%2Fhere=01%7C01%7C%7Ce6ff77fca8124af5cd7c08d43f2b2cf6%7C6
>>d4034cd72254f72b85391feaea64919%7C1=ngYNMYgIra7Kb1FPus5sp1b%2B5EBis
>>ws9%2Fsy10K%2B4l9I%3D=0>[image: Description:
>> 
>>/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Ima
>>ges/20160726_HERE_EMail_Signature_IN.gif]
>> 
>>>inkedin.com%2Fcompany%2Fheremaps=01%7C01%7C%7Ce6ff77fca8124af5cd7c08
>>d43f2b2cf6%7C6d4034cd72254f72b85391feaea64919%7C1=KqQpmAcp2l2h4R%2B
>>slYZGf0%2Brwa8c9lzbzeIK%2B94ifrc%3D=0>[image: Description:
>> 
>>/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Ima
>>ges/20160726_HERE_EMail_Signature_Insta.gif]
>> 
>>>nstagram.com%2Fhere%2F=01%7C01%7C%7Ce6ff77fca8124af5cd7c08d43f2b2cf6
>>%7C6d4034cd72254f72b85391feaea64919%7C1=ys7%2FxZdOGXtlVMFF2%2FBn3At
>>pBN0GuOPePmf9sH4j8Dg%3D=0>
>>



Re: SolrCloud: ClusterState says we are the leader but locally we don't think so

2017-01-17 Thread Pushkar Raste
Try bouncing the overseer for your cluster.

On Jan 17, 2017 12:01 PM, "Kelly, Frank"  wrote:

> Solr Version: 5.3.1
>
> Configuration: 3 shards, 3 replicas each
>
> After running out of heap memory recently (cause unknown) we’ve been
> successfully restarting nodes to recover.
>
> Finally we did one restart and one of the nodes now says the following
> 2017-01-17 16:57:16.835 ERROR (qtp1395089624-17)
> [c:prod_us-east-1_here_account s:shard3 r:core_node26
> x:prod_us-east-1_here_account_shard3_replica3] o.a.s.c.SolrCore
> org.apache.solr.common.SolrException: ClusterState says we are the leader
> (http://10.255.6.196:8983/solr/prod_us-east-1_here_account_shard3_replica3),
> but locally we don't think so. Request came from null
>
> How can we recover from this (for Solr 5.3.1)?
> Is there someway to force a new leader (I know the following feature
> exists but in 5.4.0 https://issues.apache.org/jira/browse/SOLR-7569)
>
> Thanks!
>
> -Frank
>
> [image: Description: Macintosh
> HD:Users:jerchow:Downloads:Asset_Package_01_160721:HERE_Logo_2016:sRGB:PDF:HERE_Logo_2016_POS_sRGB.pdf]
>
>
>
> *Frank Kelly*
>
> *Principal Software Engineer*
>
>
>
> HERE
>
> 5 Wayside Rd, Burlington, MA 01803, USA
>
> *42° 29' 7" N 71° 11' 32" W*
>
>
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_360.gif]
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Twitter.gif]
>    [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_FB.gif]
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_IN.gif]
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Insta.gif]
> 
>


SolrCloud: ClusterState says we are the leader but locally we don't think so

2017-01-17 Thread Kelly, Frank
Solr Version: 5.3.1

Configuration: 3 shards, 3 replicas each

After running out of heap memory recently (cause unknown) we've been 
successfully restarting nodes to recover.

Finally we did one restart and one of the nodes now says the following
2017-01-17 16:57:16.835 ERROR (qtp1395089624-17) [c:prod_us-east-1_here_account 
s:shard3 r:core_node26 x:prod_us-east-1_here_account_shard3_replica3] 
o.a.s.c.SolrCore org.apache.solr.common.SolrException: ClusterState says we are 
the leader 
(http://10.255.6.196:8983/solr/prod_us-east-1_here_account_shard3_replica3), 
but locally we don't think so. Request came from null

How can we recover from this (for Solr 5.3.1)?
Is there someway to force a new leader (I know the following feature exists but 
in 5.4.0 https://issues.apache.org/jira/browse/SOLR-7569)

Thanks!

-Frank

[Description: Macintosh 
HD:Users:jerchow:Downloads:Asset_Package_01_160721:HERE_Logo_2016:sRGB:PDF:HERE_Logo_2016_POS_sRGB.pdf]



Frank Kelly

Principal Software Engineer



HERE

5 Wayside Rd, Burlington, MA 01803, USA

42° 29' 7" N 71° 11' 32" W

[Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_360.gif]
[Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Twitter.gif]
 [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_FB.gif]
  [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_IN.gif]
  [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Insta.gif]
 


Re: SolrCloud - ClusterState says we are the leader,but locally ...

2013-02-03 Thread Marcin Rzewucki
Hi,

I think the issue was not in zk client timeout, but POST request size. When
I increased the value for Request.maxFormContentSize in jetty.xml I don't
see this issue any more.

Regards.

On 3 February 2013 01:56, Mark Miller markrmil...@gmail.com wrote:

 Do you see anything about session expiration in the logs? That is the
 likely culprit for something like this. You may need to raise the timeout:
 http://wiki.apache.org/solr/SolrCloud#FAQ

 If you see no session timeouts, I don't have a guess yet.

 - Mark

 On Feb 2, 2013, at 7:35 PM, Marcin Rzewucki mrzewu...@gmail.com wrote:

  I'm experiencing same problem in Solr4.1 during bulk loading. After 50
  minutes of indexing the following error starts to occur:
 
  INFO: [core] webapp=/solr path=/update params={} {} 0 4
  Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
 the
  leader, but locally we don't think so
 at
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295)
 at
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230)
 at
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343)
 at
 
 org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
 at
 
 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387)
 at
 
 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
 at
 
 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96)
 at
  org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60)
 at
 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
 at
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
 at
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
 at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
 at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
 at
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
 at
  org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
 at
 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at
 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
 at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at
 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at
 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at
 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:365)
 at
 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
 at
 
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at
 
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
 at
 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
 at
 org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
 at
  org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
 at
 
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at
 
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Unknown Source)
  Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
  Feb 02, 2013 11:36:31 PM org.apache.solr.cloud.ShardLeaderElectionContext
  

Re: SolrCloud - ClusterState says we are the leader,but locally ...

2013-02-03 Thread Mark Miller
What led you to trying that? I'm not connecting the dots in my head - the 
exception and the solution.

- Mark

On Feb 3, 2013, at 2:48 PM, Marcin Rzewucki mrzewu...@gmail.com wrote:

 Hi,
 
 I think the issue was not in zk client timeout, but POST request size. When
 I increased the value for Request.maxFormContentSize in jetty.xml I don't
 see this issue any more.
 
 Regards.
 
 On 3 February 2013 01:56, Mark Miller markrmil...@gmail.com wrote:
 
 Do you see anything about session expiration in the logs? That is the
 likely culprit for something like this. You may need to raise the timeout:
 http://wiki.apache.org/solr/SolrCloud#FAQ
 
 If you see no session timeouts, I don't have a guess yet.
 
 - Mark
 
 On Feb 2, 2013, at 7:35 PM, Marcin Rzewucki mrzewu...@gmail.com wrote:
 
 I'm experiencing same problem in Solr4.1 during bulk loading. After 50
 minutes of indexing the following error starts to occur:
 
 INFO: [core] webapp=/solr path=/update params={} {} 0 4
 Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
 the
 leader, but locally we don't think so
   at
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295)
   at
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230)
   at
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343)
   at
 
 org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
   at
 
 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387)
   at
 
 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
   at
 
 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96)
   at
 org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60)
   at
 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
   at
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
   at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
   at
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
   at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
   at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
   at
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
   at
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
   at
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
   at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
   at
 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
   at
 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
   at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
   at
 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
   at
 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
   at
 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
   at org.eclipse.jetty.server.Server.handle(Server.java:365)
   at
 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
   at
 
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
   at
 
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
   at
 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
   at
 org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
   at
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
   at
 
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
   at
 
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
   at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
   at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
   at java.lang.Thread.run(Unknown Source)
 Feb 02, 2013 11:36:15 PM 

Re: SolrCloud - ClusterState says we are the leader,but locally ...

2013-02-03 Thread Marcin Rzewucki
I'm loading in batches. 10 threads are reading json files and load to Solr
by sending POST request (from couple of dozens to couple of hundreds docs
in 1 request). I had 1MB post request size, but when I changed it to 10MB
errors disappeared. I guess this could be the reason.
Regards.


On 3 February 2013 20:55, Mark Miller markrmil...@gmail.com wrote:

 What led you to trying that? I'm not connecting the dots in my head - the
 exception and the solution.

 - Mark

 On Feb 3, 2013, at 2:48 PM, Marcin Rzewucki mrzewu...@gmail.com wrote:

  Hi,
 
  I think the issue was not in zk client timeout, but POST request size.
 When
  I increased the value for Request.maxFormContentSize in jetty.xml I don't
  see this issue any more.
 
  Regards.
 
  On 3 February 2013 01:56, Mark Miller markrmil...@gmail.com wrote:
 
  Do you see anything about session expiration in the logs? That is the
  likely culprit for something like this. You may need to raise the
 timeout:
  http://wiki.apache.org/solr/SolrCloud#FAQ
 
  If you see no session timeouts, I don't have a guess yet.
 
  - Mark
 
  On Feb 2, 2013, at 7:35 PM, Marcin Rzewucki mrzewu...@gmail.com
 wrote:
 
  I'm experiencing same problem in Solr4.1 during bulk loading. After 50
  minutes of indexing the following error starts to occur:
 
  INFO: [core] webapp=/solr path=/update params={} {} 0 4
  Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
  the
  leader, but locally we don't think so
at
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295)
at
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230)
at
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343)
at
 
 
 org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at
 
 
 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387)
at
 
 
 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
at
 
 
 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96)
at
  org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60)
at
 
 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at
 
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
 
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
at
 
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at
 
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at
 
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
 
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at
 
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
 
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at
 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at
 
 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
 
 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at
 
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
 
 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
 
 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
 
 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at
 
 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at
 
 
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
 
 
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at
 
 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
at
  org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
at
  org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
 
 
 

Re: SolrCloud - ClusterState says we are the leader,but locally ...

2013-02-03 Thread Shawn Heisey

On 2/3/2013 1:07 PM, Marcin Rzewucki wrote:

I'm loading in batches. 10 threads are reading json files and load to Solr
by sending POST request (from couple of dozens to couple of hundreds docs
in 1 request). I had 1MB post request size, but when I changed it to 10MB
errors disappeared. I guess this could be the reason.
Regards.


I thought SOLR-4265 changed the whole way that Solr interacts with Jetty 
and set the max form size to 2MB within Solr.  My reading says that you 
can now control the max POST size within solrconfig.xml - look for 
formdataUploadLimitInKB within the example solrconfig.


I'm curious as to exactly how you changed the max size in jetty.xml. The 
typical way that you set maxFormContentSize is broken, but jetty 8.1.9 
will fix that.  Jetty 8.1.8 is what is currently included with Solr 4.1. 
 See the following Jetty bug:


https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130

Thanks,
Shawn



Re: SolrCloud - ClusterState says we are the leader,but locally ...

2013-02-03 Thread Marcin Rzewucki
Hi,

I set this:
Call name=setAttribute
  Argorg.eclipse.jetty.server.Request.maxFormContentSize/Arg
  Arg10485760/Arg
/Call
multipartUploadLimitInKB is set to 2MB in my case. The funny is that I did
change only in jetty.xml
I'll change this value back to 1MB and repeat test to check if this is the
reason.
Regards.

On 3 February 2013 21:26, Shawn Heisey s...@elyograg.org wrote:

 On 2/3/2013 1:07 PM, Marcin Rzewucki wrote:

 I'm loading in batches. 10 threads are reading json files and load to Solr
 by sending POST request (from couple of dozens to couple of hundreds docs
 in 1 request). I had 1MB post request size, but when I changed it to 10MB
 errors disappeared. I guess this could be the reason.
 Regards.


 I thought SOLR-4265 changed the whole way that Solr interacts with Jetty
 and set the max form size to 2MB within Solr.  My reading says that you can
 now control the max POST size within solrconfig.xml - look for
 formdataUploadLimitInKB within the example solrconfig.

 I'm curious as to exactly how you changed the max size in jetty.xml. The
 typical way that you set maxFormContentSize is broken, but jetty 8.1.9 will
 fix that.  Jetty 8.1.8 is what is currently included with Solr 4.1.  See
 the following Jetty bug:

 https://bugs.eclipse.org/bugs/**show_bug.cgi?id=397130https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130

 Thanks,
 Shawn




Re: SolrCloud - ClusterState says we are the leader,but locally ...

2013-02-03 Thread Marcin Rzewucki
Hi,
I think I did 2 changes at the same time: increased maxFormContentSize and
zkClientTimeout (from 15s to 30s). When I restarted cluster there were no
ClusterState issues, but most probably due to increased zkClientTimeout
and not maxFormContentSize. I did one more test with default value
for maxFormContentSize (1M) and there were no issues either.

Regards.

On 3 February 2013 22:16, Marcin Rzewucki mrzewu...@gmail.com wrote:

 Hi,

 I set this:
 Call name=setAttribute
   Argorg.eclipse.jetty.server.Request.maxFormContentSize/Arg
   Arg10485760/Arg
 /Call
 multipartUploadLimitInKB is set to 2MB in my case. The funny is that I did
 change only in jetty.xml
 I'll change this value back to 1MB and repeat test to check if this is the
 reason.
 Regards.


 On 3 February 2013 21:26, Shawn Heisey s...@elyograg.org wrote:

 On 2/3/2013 1:07 PM, Marcin Rzewucki wrote:

 I'm loading in batches. 10 threads are reading json files and load to
 Solr
 by sending POST request (from couple of dozens to couple of hundreds docs
 in 1 request). I had 1MB post request size, but when I changed it to 10MB
 errors disappeared. I guess this could be the reason.
 Regards.


 I thought SOLR-4265 changed the whole way that Solr interacts with Jetty
 and set the max form size to 2MB within Solr.  My reading says that you can
 now control the max POST size within solrconfig.xml - look for
 formdataUploadLimitInKB within the example solrconfig.

 I'm curious as to exactly how you changed the max size in jetty.xml. The
 typical way that you set maxFormContentSize is broken, but jetty 8.1.9 will
 fix that.  Jetty 8.1.8 is what is currently included with Solr 4.1.  See
 the following Jetty bug:

 https://bugs.eclipse.org/bugs/**show_bug.cgi?id=397130https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130

 Thanks,
 Shawn





Re: SolrCloud - ClusterState says we are the leader,but locally ...

2013-02-02 Thread Marcin Rzewucki
I'm experiencing same problem in Solr4.1 during bulk loading. After 50
minutes of indexing the following error starts to occur:

INFO: [core] webapp=/solr path=/update params={} {} 0 4
Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: ClusterState says we are the
leader, but locally we don't think so
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387)
at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96)
at
org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
Feb 02, 2013 11:36:31 PM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=50699

Then leader tries to sync with replica and after it finishes I can continue
loading.
None of SolrCloud nodes was restarted during that time. I don't remember
such behaviour in Solr4.0. Could it be related with the number of fields
indexed during loading ? I have a collection with about 2400 fields. I
can't reproduce same issue for other collections with much less fields per
record.
Regards.

On 11 December 2012 19:50, Sudhakar Maddineni maddineni...@gmail.comwrote:

 Just an update on this issue:
We tried by increasing zookeeper client timeout settings to 3ms in
 solr.xml (i think default is 15000ms), and haven't seen any issues from our
 tests.
 cores 

Re: SolrCloud - ClusterState says we are the leader,but locally ...

2013-02-02 Thread Mark Miller
Do you see anything about session expiration in the logs? That is the likely 
culprit for something like this. You may need to raise the timeout: 
http://wiki.apache.org/solr/SolrCloud#FAQ

If you see no session timeouts, I don't have a guess yet.

- Mark

On Feb 2, 2013, at 7:35 PM, Marcin Rzewucki mrzewu...@gmail.com wrote:

 I'm experiencing same problem in Solr4.1 during bulk loading. After 50
 minutes of indexing the following error starts to occur:
 
 INFO: [core] webapp=/solr path=/update params={} {} 0 4
 Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: ClusterState says we are the
 leader, but locally we don't think so
at
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295)
at
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230)
at
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343)
at
 org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at
 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387)
at
 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
at
 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96)
at
 org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60)
at
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
at
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
 Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
 Feb 02, 2013 11:36:31 PM org.apache.solr.cloud.ShardLeaderElectionContext
 waitForReplicasToComeUp
 INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=50699
 
 Then leader tries to sync with replica and after it finishes I can continue
 loading.
 None of SolrCloud nodes was restarted during that time. I don't remember
 such behaviour in Solr4.0. Could it be related with the number of fields
 indexed during loading ? I have a collection with about 

Re: Solr4 SolrCloud ClusterState says we are the leader, but locally we don't think so

2013-01-25 Thread John Skopis (lists)
Actually I was mistaken. I thought we were running 4.1.0 but we were
actually running 4.0.0.

I will upgrade to 4.1.0 and see if this is still happening.

Thanks,
John

On Wed, Jan 23, 2013 at 9:39 PM, John Skopis (lists) jli...@skopis.comwrote:

 Sorry for leaving that bit out. This is Solr 4.1.0.

 Thanks again,
 John

 On Wed, Jan 23, 2013 at 5:39 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

 Hi,

 Solr4 is 4.0 or 4.1? If the former try the latter first?

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Jan 23, 2013 2:51 PM, John Skopis (lists) jli...@skopis.com wrote:

  Hello,
 
  We have recently put solr4 into production.
 
  We have a 3 node cluster with a single shard. Each solr node is also a
  zookeeper node, but zookeeper is running in cluster mode. We are using
 the
  cloudera zookeeper package.
 
  There is no communication problems between nodes. They are in two
  different racks directly connected over a 2Gb uplink. The nodes each
 have a
  1Gb uplink.
 
  I was thinking ideally mmsolr01 would be the leader, the application
 sends
  all index requests directly to the leader node. A load balancer splits
 read
  requests over the remaining two nodes.
 
  We autocommit every 300s or 10k documents with a softcommit every 5s.
 The
  index is roughly 200mm documents.
 
  I have configured a cron to run every hour (on every node):
  0 * * * * /usr/bin/curl -s '
 
 http://localhost:8983/solr/collection1/replication?command=backupnumberToKeep=3
 '
   /dev/null
 
  Using a snapshot seems to be the easiest way to reproduce, but it's also
  possible to reproduce under very heavy indexing load.
 
  When the snapshot is running, occasionally we get a zk timeout, causing
  the leader to drop out of the cluster. We have also seen a few zk
 timeouts
  when index load is very high.
 
  After the failure it can take the now inconsistent node a few hours to
  recover. After numerous failed recovery attempts the failed node seems
 to
  sync up.
 
  I have attached a log file demonstrating this.
 
  We see lots of timeout requests, seemingly when the failed node tries to
  sync up with the current leader by doing a full sync. This seems wrong,
  there should be no reason for a timeout to happen here?
 
  I am able to manually copy the index using tar + netcat in a few
 minutes.
  The replication handler takes
 
  INFO: Total time taken for download : 3549 secs
 
  Why does it take so long to recover?
 
  Are we better off manually replicating the index?
 
  Much appreciated,
  Thanks,
  John
 
 
 
 
 
 
 
 





Solr4 SolrCloud ClusterState says we are the leader, but locally we don't think so

2013-01-23 Thread John Skopis (lists)
Hello,

We have recently put solr4 into production.

We have a 3 node cluster with a single shard. Each solr node is also a
zookeeper node, but zookeeper is running in cluster mode. We are using the
cloudera zookeeper package.

There is no communication problems between nodes. They are in two different
racks directly connected over a 2Gb uplink. The nodes each have a 1Gb
uplink.

I was thinking ideally mmsolr01 would be the leader, the application sends
all index requests directly to the leader node. A load balancer splits read
requests over the remaining two nodes.

We autocommit every 300s or 10k documents with a softcommit every 5s. The
index is roughly 200mm documents.

I have configured a cron to run every hour (on every node):
0 * * * * /usr/bin/curl -s '
http://localhost:8983/solr/collection1/replication?command=backupnumberToKeep=3'
 /dev/null

Using a snapshot seems to be the easiest way to reproduce, but it's also
possible to reproduce under very heavy indexing load.

When the snapshot is running, occasionally we get a zk timeout, causing the
leader to drop out of the cluster. We have also seen a few zk timeouts when
index load is very high.

After the failure it can take the now inconsistent node a few hours to
recover. After numerous failed recovery attempts the failed node seems to
sync up.

I have attached a log file demonstrating this.

We see lots of timeout requests, seemingly when the failed node tries to
sync up with the current leader by doing a full sync. This seems wrong,
there should be no reason for a timeout to happen here?

I am able to manually copy the index using tar + netcat in a few minutes.
The replication handler takes

INFO: Total time taken for download : 3549 secs

Why does it take so long to recover?

Are we better off manually replicating the index?

Much appreciated,
Thanks,
John


sample.txt.gz
Description: GNU Zip compressed data


Re: Solr4 SolrCloud ClusterState says we are the leader, but locally we don't think so

2013-01-23 Thread Otis Gospodnetic
Hi,

Solr4 is 4.0 or 4.1? If the former try the latter first?

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Jan 23, 2013 2:51 PM, John Skopis (lists) jli...@skopis.com wrote:

 Hello,

 We have recently put solr4 into production.

 We have a 3 node cluster with a single shard. Each solr node is also a
 zookeeper node, but zookeeper is running in cluster mode. We are using the
 cloudera zookeeper package.

 There is no communication problems between nodes. They are in two
 different racks directly connected over a 2Gb uplink. The nodes each have a
 1Gb uplink.

 I was thinking ideally mmsolr01 would be the leader, the application sends
 all index requests directly to the leader node. A load balancer splits read
 requests over the remaining two nodes.

 We autocommit every 300s or 10k documents with a softcommit every 5s. The
 index is roughly 200mm documents.

 I have configured a cron to run every hour (on every node):
 0 * * * * /usr/bin/curl -s '
 http://localhost:8983/solr/collection1/replication?command=backupnumberToKeep=3'
  /dev/null

 Using a snapshot seems to be the easiest way to reproduce, but it's also
 possible to reproduce under very heavy indexing load.

 When the snapshot is running, occasionally we get a zk timeout, causing
 the leader to drop out of the cluster. We have also seen a few zk timeouts
 when index load is very high.

 After the failure it can take the now inconsistent node a few hours to
 recover. After numerous failed recovery attempts the failed node seems to
 sync up.

 I have attached a log file demonstrating this.

 We see lots of timeout requests, seemingly when the failed node tries to
 sync up with the current leader by doing a full sync. This seems wrong,
 there should be no reason for a timeout to happen here?

 I am able to manually copy the index using tar + netcat in a few minutes.
 The replication handler takes

 INFO: Total time taken for download : 3549 secs

 Why does it take so long to recover?

 Are we better off manually replicating the index?

 Much appreciated,
 Thanks,
 John










Re: Solr4 SolrCloud ClusterState says we are the leader, but locally we don't think so

2013-01-23 Thread John Skopis (lists)
Sorry for leaving that bit out. This is Solr 4.1.0.

Thanks again,
John

On Wed, Jan 23, 2013 at 5:39 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 Solr4 is 4.0 or 4.1? If the former try the latter first?

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Jan 23, 2013 2:51 PM, John Skopis (lists) jli...@skopis.com wrote:

  Hello,
 
  We have recently put solr4 into production.
 
  We have a 3 node cluster with a single shard. Each solr node is also a
  zookeeper node, but zookeeper is running in cluster mode. We are using
 the
  cloudera zookeeper package.
 
  There is no communication problems between nodes. They are in two
  different racks directly connected over a 2Gb uplink. The nodes each
 have a
  1Gb uplink.
 
  I was thinking ideally mmsolr01 would be the leader, the application
 sends
  all index requests directly to the leader node. A load balancer splits
 read
  requests over the remaining two nodes.
 
  We autocommit every 300s or 10k documents with a softcommit every 5s. The
  index is roughly 200mm documents.
 
  I have configured a cron to run every hour (on every node):
  0 * * * * /usr/bin/curl -s '
 
 http://localhost:8983/solr/collection1/replication?command=backupnumberToKeep=3
 '
   /dev/null
 
  Using a snapshot seems to be the easiest way to reproduce, but it's also
  possible to reproduce under very heavy indexing load.
 
  When the snapshot is running, occasionally we get a zk timeout, causing
  the leader to drop out of the cluster. We have also seen a few zk
 timeouts
  when index load is very high.
 
  After the failure it can take the now inconsistent node a few hours to
  recover. After numerous failed recovery attempts the failed node seems to
  sync up.
 
  I have attached a log file demonstrating this.
 
  We see lots of timeout requests, seemingly when the failed node tries to
  sync up with the current leader by doing a full sync. This seems wrong,
  there should be no reason for a timeout to happen here?
 
  I am able to manually copy the index using tar + netcat in a few minutes.
  The replication handler takes
 
  INFO: Total time taken for download : 3549 secs
 
  Why does it take so long to recover?
 
  Are we better off manually replicating the index?
 
  Much appreciated,
  Thanks,
  John
 
 
 
 
 
 
 
 



Re: SolrCloud - ClusterState says we are the leader,but locally ...

2012-12-11 Thread Sudhakar Maddineni
Just an update on this issue:
   We tried by increasing zookeeper client timeout settings to 3ms in
solr.xml (i think default is 15000ms), and haven't seen any issues from our
tests.
cores .   zkClientTimeout=3 

Thanks, Sudhakar.

On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni
maddineni...@gmail.comwrote:

 We saw this error again today during our load test - basically, whenever
 session is getting expired on the leader node, we are seeing the
 error.After this happens, leader(001) is going into 'recovery' mode and all
 the index updates are failing with 503- service unavailable error
 message.After some time(once recovery is successful), roles are swapped
 i.e. 001 acting as the replica and 003 as leader.

 Btw, do you know why the connection to zookeeper[solr-zk] getting
 interrupted in the middle?
 is it because of the load(no of updates) we are putting on the cluster?

 Here is the exception stack trace:

 *Dec* *7*, *2012* *2:28:03* *PM* 
 *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader*
 *WARNING:* *org.apache.zookeeper.KeeperException$SessionExpiredException:* 
 *KeeperErrorCode* *=* *Session* *expired* *for* */overseer_elect/leader*
   *at* 
 *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
   *at* 
 *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
   *at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*
 )
   *at* 
 *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
   *at* 
 *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
   *at* 
 *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
   *at* 
 *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
   *at* 
 *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
   *at* 
 *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
   *at* *java.lang.Thread.run*(*Unknown* *Source*)

 Thx,Sudhakar.



 On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni maddineni...@gmail.com
  wrote:

 Erick:
   Not seeing any page caching related issues...

 Mark:
   1.Would this waiting on 003(replica) cause any inconsistencies in the
 zookeeper cluster state? I was also looking at the leader(001) logs at that
 time and seeing errors related to *SEVERE: ClusterState says we are the
 leader, but locally we don't think so*.
   2.Also, all of our servers in cluster were gone down when the index
 updates were running in parallel along with this issue.Do you see this
 related to the session expiry on 001?

 Here are the logs on 001
 =

 Dec 4, 2012 12:12:29 PM
 org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
 WARNING:
 org.apache.zookeeper.KeeperException$SessionExpiredException:
 KeeperErrorCode = Session expired for /overseer_elect/leader
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
 Dec 4, 2012 12:12:29 PM
 org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
 INFO: According to ZK I
 (id=232887758696546307-001:8080_solr-n_05) am no longer a leader.

 Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.OverseerCollectionProcessor
 run
 WARNING: Overseer cannot talk to ZK

 Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
 SEVERE: There was a problem finding the leader in
 zk:org.apache.solr.common.SolrException: Could not get leader props
  at
 org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
 at
 org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
  Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
 SEVERE: There was a problem finding the leader in
 zk:org.apache.solr.common.SolrException: Could not get leader props
  at
 org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
 at
 org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
  Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
 SEVERE: There was a problem making a request to the
 leader:org.apache.solr.common.SolrException: I was asked to wait on state
 down for 001:8080_solr but I still do not see the request state. I see
 state: active live:true
  at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
  Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
 SEVERE: There was a problem making a request to the
 leader:org.apache.solr.common.SolrException: I was asked to wait on state
 down for 001:8080_solr but I still do not see the request state. I see
 state: active live:true
  at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
 
  
 Dec 4, 2012 12:19:10 PM 

Re: SolrCloud - ClusterState says we are the leader,but locally ...

2012-12-07 Thread Sudhakar Maddineni
Erick:
  Not seeing any page caching related issues...

Mark:
  1.Would this waiting on 003(replica) cause any inconsistencies in the
zookeeper cluster state? I was also looking at the leader(001) logs at that
time and seeing errors related to *SEVERE: ClusterState says we are the
leader, but locally we don't think so*.
  2.Also, all of our servers in cluster were gone down when the index
updates were running in parallel along with this issue.Do you see this
related to the session expiry on 001?

Here are the logs on 001
=

Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
amILeader
WARNING:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer_elect/leader
at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
amILeader
INFO: According to ZK I
(id=232887758696546307-001:8080_solr-n_05) am no longer a leader.

Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.OverseerCollectionProcessor
run
WARNING: Overseer cannot talk to ZK

Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
SEVERE: There was a problem finding the leader in
zk:org.apache.solr.common.SolrException: Could not get leader props
at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
SEVERE: There was a problem finding the leader in
zk:org.apache.solr.common.SolrException: Could not get leader props
at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
SEVERE: There was a problem making a request to the
leader:org.apache.solr.common.SolrException: I was asked to wait on state
down for 001:8080_solr but I still do not see the request state. I see
state: active live:true
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
SEVERE: There was a problem making a request to the
leader:org.apache.solr.common.SolrException: I was asked to wait on state
down for 001:8080_solr but I still do not see the request state. I see
state: active live:true
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)


Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
SEVERE: There was a problem finding the leader in
zk:org.apache.solr.common.SolrException: Could not get leader props
at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)


Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
SEVERE: :org.apache.solr.common.SolrException: There was a problem finding
the leader in zk
at
org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
at
org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
SEVERE: Error getting leader from zk
org.apache.solr.common.SolrException: *There is conflicting information
about the leader of shard: shard1 our state says:http://001:8080/solr/core1/
but zookeeper says:http://003:8080/solr/core1/*
* at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
* at org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ShardLeaderElectionContext
runLeaderProcess
INFO: Running the leader process.



Thanks for your inputs.
Sudhakar.







On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller markrmil...@gmail.com wrote:

 Yes - it means that 001 went down (or more likely had it's connection to
 ZooKeeper interrupted? that's what I mean about a session timeout - if the
 solr-zk link is broken for longer than the session timeout that will
 trigger a leader election and when the connection is reestablished, the
 node will have to recover). That waiting should stop as soon as 001 came
 back up or reconnected to ZooKeeper.

 In fact, this waiting should not happen in this case - but only on cluster
 restart. This is a bug that is fixed in 4.1 (hopefully coming very soon!):

 * SOLR-3940: Rejoining the leader election incorrectly triggers the code
 path
   for a fresh cluster start rather than fail over. (Mark Miller)

 - Mark

 On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni maddineni...@gmail.com
 wrote:

  Yep, after restarting, cluster came back to normal state.We will run
 couple of more tests and see if we could reproduce this issue.
 
  Btw, I am attaching the server logs before that 'INFO: Waiting 

Re: SolrCloud - ClusterState says we are the leader,but locally ...

2012-12-07 Thread Sudhakar Maddineni
We saw this error again today during our load test - basically, whenever
session is getting expired on the leader node, we are seeing the
error.After this happens, leader(001) is going into 'recovery' mode and all
the index updates are failing with 503- service unavailable error
message.After some time(once recovery is successful), roles are swapped
i.e. 001 acting as the replica and 003 as leader.

Btw, do you know why the connection to zookeeper[solr-zk] getting
interrupted in the middle?
is it because of the load(no of updates) we are putting on the cluster?

Here is the exception stack trace:

*Dec* *7*, *2012* *2:28:03* *PM*
*org.apache.solr.cloud.Overseer$ClusterStateUpdater*
*amILeader**WARNING:*
*org.apache.zookeeper.KeeperException$SessionExpiredException:*
*KeeperErrorCode* *=* *Session* *expired* *for*
*/overseer_elect/leader*
*at* 
*org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
*at* 
*org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
*at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*)
*at* 
*org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
*at* 
*org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
*at* 
*org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
*at* 
*org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
*at* 
*org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
*at* 
*org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
*at* *java.lang.Thread.run*(*Unknown* *Source*)

Thx,Sudhakar.



On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni
maddineni...@gmail.comwrote:

 Erick:
   Not seeing any page caching related issues...

 Mark:
   1.Would this waiting on 003(replica) cause any inconsistencies in the
 zookeeper cluster state? I was also looking at the leader(001) logs at that
 time and seeing errors related to *SEVERE: ClusterState says we are the
 leader, but locally we don't think so*.
   2.Also, all of our servers in cluster were gone down when the index
 updates were running in parallel along with this issue.Do you see this
 related to the session expiry on 001?

 Here are the logs on 001
 =

 Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
 amILeader
 WARNING:
 org.apache.zookeeper.KeeperException$SessionExpiredException:
 KeeperErrorCode = Session expired for /overseer_elect/leader
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
 Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
 amILeader
 INFO: According to ZK I
 (id=232887758696546307-001:8080_solr-n_05) am no longer a leader.

 Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.OverseerCollectionProcessor
 run
 WARNING: Overseer cannot talk to ZK

 Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
 SEVERE: There was a problem finding the leader in
 zk:org.apache.solr.common.SolrException: Could not get leader props
  at
 org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
 at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
  Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
 SEVERE: There was a problem finding the leader in
 zk:org.apache.solr.common.SolrException: Could not get leader props
  at
 org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
 at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
  Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
 SEVERE: There was a problem making a request to the
 leader:org.apache.solr.common.SolrException: I was asked to wait on state
 down for 001:8080_solr but I still do not see the request state. I see
 state: active live:true
  at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
  Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
 SEVERE: There was a problem making a request to the
 leader:org.apache.solr.common.SolrException: I was asked to wait on state
 down for 001:8080_solr but I still do not see the request state. I see
 state: active live:true
  at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
 
  
 Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
 SEVERE: There was a problem finding the leader in
 zk:org.apache.solr.common.SolrException: Could not get leader props
  at
 org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
 
  
 Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
 SEVERE: :org.apache.solr.common.SolrException: There was a problem 

Re: SolrCloud - ClusterState says we are the leader,but locally ...

2012-12-06 Thread Erick Erickson
I've seen the Waiting until we see... message as well, it seems for me to
be an artifact of bouncing servers rapidly. It took a lot of patience to
wait until the timeoutin value got all the way to 0, but when it did the
system recovered.

As to your original problem, are you possibly getting page caching at the
servlet level?

Best
Erick


On Wed, Dec 5, 2012 at 9:41 PM, Sudhakar Maddineni
maddineni...@gmail.comwrote:

 Yep, after restarting, cluster came back to normal state.We will run
 couple of more tests and see if we could reproduce this issue.

 Btw, I am attaching the server logs before that 'INFO: *Waiting until we
 see more replicas*'  message.From the logs, we can see that leader
 election process started on 003 which was the replica for 001
 initially.That means leader 001 went down at that time?

 logs on 003:
 
 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
 runLeaderProcess
 INFO: Running the leader process.
 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
 shouldIBeLeader
 INFO: Checking if I should try and be the leader.
 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
 shouldIBeLeader
 INFO: My last published State was Active, it's okay to be the
 leader.
 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
 runLeaderProcess
 INFO: I may be the new leader - try and sync
 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
 WARNING: Stopping recovery for zkNodeName=003:8080_solr_core
 core=core1.
 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
 INFO: Sync replicas to http://003:8080/solr/core1/
 12:11:16 PM org.apache.solr.update.PeerSync sync
 INFO: PeerSync: core=core1 url=http://003:8080/solr START
 replicas=[001:8080/solr/core1/] nUpdates=100
 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
 INFO: Updating live nodes - this message is on 002
 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
 WARNING: PeerSync: core=core1 url=http://003:8080/solr
  exception talking to 001:8080/solr/core1/, failed
 org.apache.solr.client.solrj.SolrServerException: Timeout occured
 while waiting response from server at: 001:8080/solr/core1
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 at
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
 at
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
 Source)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
 Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
 Source)
 at java.lang.Thread.run(Unknown Source)
 Caused by: java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(Unknown Source)
 12:11:46 PM org.apache.solr.update.PeerSync sync
 INFO: PeerSync: core=core1 url=http://003:8080/solr DONE. sync
 failed
 12:11:46 PM org.apache.solr.common.SolrException log
 SEVERE: Sync Failed
 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
 rejoinLeaderElection
 INFO: There is a better leader candidate than us - going back into
 recovery
 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
 INFO: Running recovery - first canceling any ongoing recovery
 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
 INFO: Starting recovery process.  core=core1
 recoveringAfterStartup=false
 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
 INFO: Attempting to PeerSync from 001:8080/solr/core1/
 core=core1 - recoveringAfterStartup=false
 12:11:46 PM org.apache.solr.update.PeerSync sync
 INFO: PeerSync: core=core1 url=http://003:8080/solr START
 replicas=[001:8080/solr/core1/] nUpdates=100
 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
 runLeaderProcess
 INFO: Running the leader process.
 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
 waitForReplicasToComeUp
 INFO: *Waiting until we see more replicas up: total=2 found=1
 timeoutin=17*
 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
 waitForReplicasToComeUp
 INFO: *Waiting until we see more replicas up: total=2 found=1
 timeoutin=179495*
 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
 waitForReplicasToComeUp
 INFO: *Waiting until we see 

Re: SolrCloud - ClusterState says we are the leader,but locally ...

2012-12-06 Thread Mark Miller
Yes - it means that 001 went down (or more likely had it's connection to 
ZooKeeper interrupted? that's what I mean about a session timeout - if the 
solr-zk link is broken for longer than the session timeout that will trigger a 
leader election and when the connection is reestablished, the node will have to 
recover). That waiting should stop as soon as 001 came back up or reconnected 
to ZooKeeper.

In fact, this waiting should not happen in this case - but only on cluster 
restart. This is a bug that is fixed in 4.1 (hopefully coming very soon!):

* SOLR-3940: Rejoining the leader election incorrectly triggers the code path
  for a fresh cluster start rather than fail over. (Mark Miller)

- Mark

On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni maddineni...@gmail.com wrote:

 Yep, after restarting, cluster came back to normal state.We will run couple 
 of more tests and see if we could reproduce this issue.
 
 Btw, I am attaching the server logs before that 'INFO: Waiting until we see 
 more replicas'  message.From the logs, we can see that leader election 
 process started on 003 which was the replica for 001 initially.That means 
 leader 001 went down at that time?
 
 logs on 003:
 
 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
 INFO: Running the leader process.
 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader
 INFO: Checking if I should try and be the leader.
 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader
 INFO: My last published State was Active, it's okay to be the leader.
 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
 INFO: I may be the new leader - try and sync
 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
 WARNING: Stopping recovery for zkNodeName=003:8080_solr_core 
 core=core1.
 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
 INFO: Sync replicas to http://003:8080/solr/core1/
 12:11:16 PM org.apache.solr.update.PeerSync sync
 INFO: PeerSync: core=core1 url=http://003:8080/solr START 
 replicas=[001:8080/solr/core1/] nUpdates=100
 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
 INFO: Updating live nodes - this message is on 002
 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
 WARNING: PeerSync: core=core1 url=http://003:8080/solr  exception 
 talking to 001:8080/solr/core1/, failed
 org.apache.solr.client.solrj.SolrServerException: Timeout occured 
 while waiting response from server at: 001:8080/solr/core1
   at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
   at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
   at 
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
   at 
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
   at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
   at java.util.concurrent.FutureTask.run(Unknown Source)
   at java.util.concurrent.Executors$RunnableAdapter.call(Unknown 
 Source)
   at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
   at java.util.concurrent.FutureTask.run(Unknown Source)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
 Source)
   at java.lang.Thread.run(Unknown Source)
 Caused by: java.net.SocketTimeoutException: Read timed out
   at java.net.SocketInputStream.socketRead0(Native Method)
   at java.net.SocketInputStream.read(Unknown Source)
 12:11:46 PM org.apache.solr.update.PeerSync sync
 INFO: PeerSync: core=core1 url=http://003:8080/solr DONE. sync 
 failed
 12:11:46 PM org.apache.solr.common.SolrException log
 SEVERE: Sync Failed
 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext 
 rejoinLeaderElection
 INFO: There is a better leader candidate than us - going back into 
 recovery
 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
 INFO: Running recovery - first canceling any ongoing recovery
 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
 INFO: Starting recovery process.  core=core1 
 recoveringAfterStartup=false
 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
 INFO: Attempting to PeerSync from 001:8080/solr/core1/ core=core1 - 
 recoveringAfterStartup=false
 12:11:46 PM org.apache.solr.update.PeerSync sync
 INFO: PeerSync: core=core1 url=http://003:8080/solr START 
 replicas=[001:8080/solr/core1/] nUpdates=100
 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
 INFO: Running the leader process.
 12:11:46 PM 

Re: SolrCloud - ClusterState says we are the leader,but locally ...

2012-12-05 Thread Mark Miller
What Solr version - beta, alpha, 4.0 final, 4X or 5X?

- Mark

On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni maddineni...@gmail.com wrote:

 Hi,
 We are uploading solr documents to the index in batches using 30 threads
 and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set to
 1.
 In the code, we are using HttpSolrServer and add(inputDoc) method to add
 docx.
 And, we have the following commit settings in solrconfig:
 
 autoCommit
   maxTime30/maxTime
   maxDocs1/maxDocs
   openSearcherfalse/openSearcher
 /autoCommit
 
   autoSoftCommit
 maxTime1000/maxTime
   /autoSoftCommit
 
 Cluster Details:
 
 solr version - 4.0
 zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
 numshards=2 ,
 001, 002, 003 are the solr nodes and these three are behind the
 loadbalancer  vip
 001, 003 assigned to shard1; 002 assigned to shard2
 
 
 Logs:Getting the errors in the below sequence after uploading some docx:
 ---
 003
 Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
 waitForReplicasToComeUp
 INFO: Waiting until we see more replicas up: total=2 found=1
 timeoutin=17
 
 001
 Dec 4, 2012 12:12:59 PM
 org.apache.solr.update.processor.DistributedUpdateProcessor
 doDefensiveChecks
 SEVERE: ClusterState says we are the leader, but locally we don't think so
 
 003
 Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
 SEVERE: forwarding update to 001:8080/solr/core1/ failed - retrying ...
 
 001
 Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
 SEVERE: Error uploading: org.apache.solr.common.SolrException: Server at
 vip/solr/core1. returned non ok status:503, message:Service Unavailable
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 001
 Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
 SEVERE: Error while trying to recover.
 core=core1:org.apache.solr.common.SolrException: We are not the leader
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
 
 001
 Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
 SEVERE: Error uploading: org.apache.solr.client.solrj.SolrServerException:
 IOException occured when talking to server at vip/solr/core1
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 at
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
 at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
 ... 5 lines omitted ...
 at java.lang.Thread.run(Unknown Source)
 Caused by: java.net.SocketException: Connection reset
 
 
 After sometime, all the three servers are going down.
 
 Appreciate, if someone could let us know what we are missing.
 
 Thx,Sudhakar.



Re: SolrCloud - ClusterState says we are the leader,but locally ...

2012-12-05 Thread Sudhakar Maddineni
using solr version - 4.0 final.

Thx, Sudhakar.

On Wed, Dec 5, 2012 at 5:26 PM, Mark Miller markrmil...@gmail.com wrote:

 What Solr version - beta, alpha, 4.0 final, 4X or 5X?

 - Mark

 On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni maddineni...@gmail.com
 wrote:

  Hi,
  We are uploading solr documents to the index in batches using 30 threads
  and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set to
  1.
  In the code, we are using HttpSolrServer and add(inputDoc) method to add
  docx.
  And, we have the following commit settings in solrconfig:
 
  autoCommit
maxTime30/maxTime
maxDocs1/maxDocs
openSearcherfalse/openSearcher
  /autoCommit
 
autoSoftCommit
  maxTime1000/maxTime
/autoSoftCommit
 
  Cluster Details:
  
  solr version - 4.0
  zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
  numshards=2 ,
  001, 002, 003 are the solr nodes and these three are behind the
  loadbalancer  vip
  001, 003 assigned to shard1; 002 assigned to shard2
 
 
  Logs:Getting the errors in the below sequence after uploading some docx:
 
 ---
  003
  Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
  waitForReplicasToComeUp
  INFO: Waiting until we see more replicas up: total=2 found=1
  timeoutin=17
 
  001
  Dec 4, 2012 12:12:59 PM
  org.apache.solr.update.processor.DistributedUpdateProcessor
  doDefensiveChecks
  SEVERE: ClusterState says we are the leader, but locally we don't think
 so
 
  003
  Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
  SEVERE: forwarding update to 001:8080/solr/core1/ failed - retrying ...
 
  001
  Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
  SEVERE: Error uploading: org.apache.solr.common.SolrException: Server at
  vip/solr/core1. returned non ok status:503, message:Service Unavailable
  at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
  at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
  001
  Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
  SEVERE: Error while trying to recover.
  core=core1:org.apache.solr.common.SolrException: We are not the leader
  at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
 
  001
  Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
  SEVERE: Error uploading:
 org.apache.solr.client.solrj.SolrServerException:
  IOException occured when talking to server at vip/solr/core1
  at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
  at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
  at
 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
  at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
  ... 5 lines omitted ...
  at java.lang.Thread.run(Unknown Source)
  Caused by: java.net.SocketException: Connection reset
 
 
  After sometime, all the three servers are going down.
 
  Appreciate, if someone could let us know what we are missing.
 
  Thx,Sudhakar.




Re: SolrCloud - ClusterState says we are the leader,but locally ...

2012-12-05 Thread Mark Miller
It kind of looks like the urls solrcloud is using are not accessible. When you 
go to the admin page and the cloud tab, can you access the urls it shows for 
each shard? That is, if you click on of the links or copy and paste the address 
into a web browser, does it work?

You may have to explicitly set the host= in solr.xml if it's not auto detecting 
the right one. Make sure the ports like right too.

 waitForReplicasToComeUp
 INFO: Waiting until we see more replicas up: total=2 found=1
 timeoutin=17

That happens when you stop the cluster and try to start it again - before a 
leader is chosen, it will wait for all known replicas fora shard to come up so 
that everyone can sync up and have a chance to be the best leader. So at this 
point it was only finding one of 2 known replicas and waiting for the second to 
come up. After a couple minutes (configurable) it will just continue anyway 
without the missing replica (if it doesn't show up).

- Mark

On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni maddineni...@gmail.com wrote:

 Hi,
 We are uploading solr documents to the index in batches using 30 threads
 and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set to
 1.
 In the code, we are using HttpSolrServer and add(inputDoc) method to add
 docx.
 And, we have the following commit settings in solrconfig:
 
 autoCommit
   maxTime30/maxTime
   maxDocs1/maxDocs
   openSearcherfalse/openSearcher
 /autoCommit
 
   autoSoftCommit
 maxTime1000/maxTime
   /autoSoftCommit
 
 Cluster Details:
 
 solr version - 4.0
 zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
 numshards=2 ,
 001, 002, 003 are the solr nodes and these three are behind the
 loadbalancer  vip
 001, 003 assigned to shard1; 002 assigned to shard2
 
 
 Logs:Getting the errors in the below sequence after uploading some docx:
 ---
 003
 Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
 waitForReplicasToComeUp
 INFO: Waiting until we see more replicas up: total=2 found=1
 timeoutin=17
 
 001
 Dec 4, 2012 12:12:59 PM
 org.apache.solr.update.processor.DistributedUpdateProcessor
 doDefensiveChecks
 SEVERE: ClusterState says we are the leader, but locally we don't think so
 
 003
 Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
 SEVERE: forwarding update to 001:8080/solr/core1/ failed - retrying ...
 
 001
 Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
 SEVERE: Error uploading: org.apache.solr.common.SolrException: Server at
 vip/solr/core1. returned non ok status:503, message:Service Unavailable
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 001
 Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
 SEVERE: Error while trying to recover.
 core=core1:org.apache.solr.common.SolrException: We are not the leader
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
 
 001
 Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
 SEVERE: Error uploading: org.apache.solr.client.solrj.SolrServerException:
 IOException occured when talking to server at vip/solr/core1
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 at
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
 at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
 ... 5 lines omitted ...
 at java.lang.Thread.run(Unknown Source)
 Caused by: java.net.SocketException: Connection reset
 
 
 After sometime, all the three servers are going down.
 
 Appreciate, if someone could let us know what we are missing.
 
 Thx,Sudhakar.



Re: SolrCloud - ClusterState says we are the leader,but locally ...

2012-12-05 Thread Sudhakar Maddineni
Hey Mark,

Yes, I am able to access all of the nodes under each shard from solrcloud
admin UI.


   - *It kind of looks like the urls solrcloud is using are not accessible.
   When you go to the admin page and the cloud tab, can you access the urls it
   shows for each shard? That is, if you click on of the links or copy and
   paste the address into a web browser, does it work?*

Actually, I got these errors when my document upload task/job was running,
not during the cluster restart. Also,job ran fine initially for the first
one hour and started throwing these errors after indexing some docx.

Thx, Sudhakar.




On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller markrmil...@gmail.com wrote:

 It kind of looks like the urls solrcloud is using are not accessible. When
 you go to the admin page and the cloud tab, can you access the urls it
 shows for each shard? That is, if you click on of the links or copy and
 paste the address into a web browser, does it work?

 You may have to explicitly set the host= in solr.xml if it's not auto
 detecting the right one. Make sure the ports like right too.

  waitForReplicasToComeUp
  INFO: Waiting until we see more replicas up: total=2 found=1
  timeoutin=17

 That happens when you stop the cluster and try to start it again - before
 a leader is chosen, it will wait for all known replicas fora shard to come
 up so that everyone can sync up and have a chance to be the best leader. So
 at this point it was only finding one of 2 known replicas and waiting for
 the second to come up. After a couple minutes (configurable) it will just
 continue anyway without the missing replica (if it doesn't show up).

 - Mark

 On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni maddineni...@gmail.com
 wrote:

  Hi,
  We are uploading solr documents to the index in batches using 30 threads
  and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set to
  1.
  In the code, we are using HttpSolrServer and add(inputDoc) method to add
  docx.
  And, we have the following commit settings in solrconfig:
 
  autoCommit
maxTime30/maxTime
maxDocs1/maxDocs
openSearcherfalse/openSearcher
  /autoCommit
 
autoSoftCommit
  maxTime1000/maxTime
/autoSoftCommit
 
  Cluster Details:
  
  solr version - 4.0
  zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
  numshards=2 ,
  001, 002, 003 are the solr nodes and these three are behind the
  loadbalancer  vip
  001, 003 assigned to shard1; 002 assigned to shard2
 
 
  Logs:Getting the errors in the below sequence after uploading some docx:
 
 ---
  003
  Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
  waitForReplicasToComeUp
  INFO: Waiting until we see more replicas up: total=2 found=1
  timeoutin=17
 
  001
  Dec 4, 2012 12:12:59 PM
  org.apache.solr.update.processor.DistributedUpdateProcessor
  doDefensiveChecks
  SEVERE: ClusterState says we are the leader, but locally we don't think
 so
 
  003
  Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
  SEVERE: forwarding update to 001:8080/solr/core1/ failed - retrying ...
 
  001
  Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
  SEVERE: Error uploading: org.apache.solr.common.SolrException: Server at
  vip/solr/core1. returned non ok status:503, message:Service Unavailable
  at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
  at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
  001
  Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
  SEVERE: Error while trying to recover.
  core=core1:org.apache.solr.common.SolrException: We are not the leader
  at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
 
  001
  Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
  SEVERE: Error uploading:
 org.apache.solr.client.solrj.SolrServerException:
  IOException occured when talking to server at vip/solr/core1
  at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
  at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
  at
 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
  at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
  ... 5 lines omitted ...
  at java.lang.Thread.run(Unknown Source)
  Caused by: java.net.SocketException: Connection reset
 
 
  After sometime, all the three servers are going down.
 
  Appreciate, if someone could let us know what we are missing.
 
  Thx,Sudhakar.




Re: SolrCloud - ClusterState says we are the leader,but locally ...

2012-12-05 Thread Mark Miller
The waiting logging had to happen on restart unless it's some kind of bug.

Beyond that, something is off, but I have no clue why - it seems your 
clusterstate.json is not up to date at all.

Have you tried restarting the cluster then? Does that help at all?

Do you see any exceptions around zookeeper session timeouts?

- Mark

On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni maddineni...@gmail.com wrote:

 Hey Mark,
 
 Yes, I am able to access all of the nodes under each shard from solrcloud
 admin UI.
 
 
   - *It kind of looks like the urls solrcloud is using are not accessible.
   When you go to the admin page and the cloud tab, can you access the urls it
   shows for each shard? That is, if you click on of the links or copy and
   paste the address into a web browser, does it work?*
 
 Actually, I got these errors when my document upload task/job was running,
 not during the cluster restart. Also,job ran fine initially for the first
 one hour and started throwing these errors after indexing some docx.
 
 Thx, Sudhakar.
 
 
 
 
 On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller markrmil...@gmail.com wrote:
 
 It kind of looks like the urls solrcloud is using are not accessible. When
 you go to the admin page and the cloud tab, can you access the urls it
 shows for each shard? That is, if you click on of the links or copy and
 paste the address into a web browser, does it work?
 
 You may have to explicitly set the host= in solr.xml if it's not auto
 detecting the right one. Make sure the ports like right too.
 
 waitForReplicasToComeUp
 INFO: Waiting until we see more replicas up: total=2 found=1
 timeoutin=17
 
 That happens when you stop the cluster and try to start it again - before
 a leader is chosen, it will wait for all known replicas fora shard to come
 up so that everyone can sync up and have a chance to be the best leader. So
 at this point it was only finding one of 2 known replicas and waiting for
 the second to come up. After a couple minutes (configurable) it will just
 continue anyway without the missing replica (if it doesn't show up).
 
 - Mark
 
 On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni maddineni...@gmail.com
 wrote:
 
 Hi,
 We are uploading solr documents to the index in batches using 30 threads
 and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set to
 1.
 In the code, we are using HttpSolrServer and add(inputDoc) method to add
 docx.
 And, we have the following commit settings in solrconfig:
 
autoCommit
  maxTime30/maxTime
  maxDocs1/maxDocs
  openSearcherfalse/openSearcher
/autoCommit
 
  autoSoftCommit
maxTime1000/maxTime
  /autoSoftCommit
 
 Cluster Details:
 
 solr version - 4.0
 zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
 numshards=2 ,
 001, 002, 003 are the solr nodes and these three are behind the
 loadbalancer  vip
 001, 003 assigned to shard1; 002 assigned to shard2
 
 
 Logs:Getting the errors in the below sequence after uploading some docx:
 
 ---
 003
 Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
 waitForReplicasToComeUp
 INFO: Waiting until we see more replicas up: total=2 found=1
 timeoutin=17
 
 001
 Dec 4, 2012 12:12:59 PM
 org.apache.solr.update.processor.DistributedUpdateProcessor
 doDefensiveChecks
 SEVERE: ClusterState says we are the leader, but locally we don't think
 so
 
 003
 Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
 SEVERE: forwarding update to 001:8080/solr/core1/ failed - retrying ...
 
 001
 Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
 SEVERE: Error uploading: org.apache.solr.common.SolrException: Server at
 vip/solr/core1. returned non ok status:503, message:Service Unavailable
 at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
 at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 001
 Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
 SEVERE: Error while trying to recover.
 core=core1:org.apache.solr.common.SolrException: We are not the leader
 at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
 
 001
 Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
 SEVERE: Error uploading:
 org.apache.solr.client.solrj.SolrServerException:
 IOException occured when talking to server at vip/solr/core1
 at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
 at
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 at
 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
 at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
 ... 5 lines omitted ...
 at java.lang.Thread.run(Unknown Source)
 Caused by: 

Re: SolrCloud - ClusterState says we are the leader,but locally ...

2012-12-05 Thread Sudhakar Maddineni
Yep, after restarting, cluster came back to normal state.We will run couple
of more tests and see if we could reproduce this issue.

Btw, I am attaching the server logs before that 'INFO: *Waiting until we
see more replicas*'  message.From the logs, we can see that leader election
process started on 003 which was the replica for 001 initially.That means
leader 001 went down at that time?

logs on 003:

12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
runLeaderProcess
INFO: Running the leader process.
12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader
INFO: Checking if I should try and be the leader.
12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader
INFO: My last published State was Active, it's okay to be the
leader.
12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
runLeaderProcess
INFO: I may be the new leader - try and sync
12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for zkNodeName=003:8080_solr_core
core=core1.
12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
INFO: Sync replicas to http://003:8080/solr/core1/
12:11:16 PM org.apache.solr.update.PeerSync sync
INFO: PeerSync: core=core1 url=http://003:8080/solr START
replicas=[001:8080/solr/core1/] nUpdates=100
12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
INFO: Updating live nodes - this message is on 002
12:11:46 PM org.apache.solr.update.PeerSync handleResponse
WARNING: PeerSync: core=core1 url=http://003:8080/solr  exception
talking to 001:8080/solr/core1/, failed
org.apache.solr.client.solrj.SolrServerException: Timeout occured
while waiting response from server at: 001:8080/solr/core1
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
12:11:46 PM org.apache.solr.update.PeerSync sync
INFO: PeerSync: core=core1 url=http://003:8080/solr DONE. sync
failed
12:11:46 PM org.apache.solr.common.SolrException log
SEVERE: Sync Failed
12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
rejoinLeaderElection
INFO: There is a better leader candidate than us - going back into
recovery
12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
INFO: Running recovery - first canceling any ongoing recovery
12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
INFO: Starting recovery process.  core=core1
recoveringAfterStartup=false
12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
INFO: Attempting to PeerSync from 001:8080/solr/core1/ core=core1
- recoveringAfterStartup=false
12:11:46 PM org.apache.solr.update.PeerSync sync
INFO: PeerSync: core=core1 url=http://003:8080/solr START
replicas=[001:8080/solr/core1/] nUpdates=100
12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
runLeaderProcess
INFO: Running the leader process.
12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: *Waiting until we see more replicas up: total=2 found=1
timeoutin=17*
12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: *Waiting until we see more replicas up: total=2 found=1
timeoutin=179495*
12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: *Waiting until we see more replicas up: total=2 found=1
timeoutin=178985*



Thanks for your help.
Sudhakar.

On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller markrmil...@gmail.com wrote:

 The waiting logging had to happen on restart unless it's some kind of bug.

 Beyond that, something is off, but I have no clue why - it seems your
 clusterstate.json is not up to date at all.

 Have you tried restarting the cluster then? Does that help at all?

 Do you see any exceptions around zookeeper session timeouts?

 - Mark

 On Dec