Re: SolrCloud failover behavior

2012-11-06 Thread Nick Chase
Thanks a million, Erick!  You're right about killing both nodes hosting 
the shard.  I'll get the wiki corrected.


  Nick

On 11/3/2012 10:51 PM, Erick Erickson wrote:

SolrCloud doesn't work unless every shard has at least one server that is
up and running.

I _think_ you might be killing both nodes that host one of the shards. The
admin
page has a link showing you the state of your cluster. So when this happens,
does that page show both nodes for that shard being down?

And yeah, SolrCloud requires a quorum of ZK nodes up. So with only one ZK
node, killing that will bring down the whole cluster. Which is why the
usual
recommendation is that ZK be run externally and usually an odd number of ZK
nodes (three or more).

Anyone can create a login and edit the Wiki, so any clarifications are
welcome!

Best
Erick


On Sat, Nov 3, 2012 at 12:17 PM, Nick Chase nch...@earthlink.net wrote:


I think there's a change in the behavior of SolrCloud vs. what's in the
wiki, but I was hoping someone could confirm for me.  I checked JIRA and
there were a couple of issues requesting partial results if one server
comes down, but that doesn't seem to be the issue here.  I also checked
CHANGES.txt and don't see anything that seems to apply.

I'm running Example B: Simple two shard cluster with shard replicas from
the wiki at 
https://wiki.apache.org/solr/**SolrCloudhttps://wiki.apache.org/solr/SolrCloudand
 everything starts out as expected.  However, when I get to the part
about fail over behavior is when things get a little wonky.

I added data to the shard running on 7475.  If I kill 7500, a query to any
of the other servers works fine.  But if I kill 7475, rather than getting
zero results on a search to 8983 or 8900, I get a 503 error:

response
lst name=responseHeader
   int name=status503/int
   int name=QTime5/int
   lst name=params
  str name=q*:*/str
   /lst
/lst
lst name=error
   str name=msgno servers hosting shard:/str
   int name=code503/int
/lst
/response

I don't see any errors in the consoles.

Also, if I kill 8983, which includes the Zookeeper server, everything
dies, rather than just staying in a steady state; the other servers
continually show:

Nov 03, 2012 11:39:34 AM org.apache.zookeeper.**ClientCnxn$SendThread
startConnect
NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983
ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread run
ARNING: Session 0x13ac6cf87890002 for server null, unexpected error,
closing socket connection and attempting reconnect
ava.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.**checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.**finishConnect(Unknown Source)
at org.apache.zookeeper.**ClientCnxn$SendThread.run(**
ClientCnxn.java:1143)

ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread
startConnect

over and over again, and a call to any of the servers shows a connection
error to 8983.

This is the current 4.0.0 release, running on Windows 7.

If this is the proper behavior and the wiki needs updating, fine; I just
need to know.  Otherwise if anybody has any clues as to what I may be
missing, I'd be grateful. :)

Thanks...

---  Nick





Re: SolrCloud failover behavior

2012-11-06 Thread Erick Erickson
I was right for once G..

Thanks for updating the Wiki!

Erick


On Tue, Nov 6, 2012 at 9:42 AM, Nick Chase nch...@earthlink.net wrote:

 Thanks a million, Erick!  You're right about killing both nodes hosting
 the shard.  I'll get the wiki corrected.

   Nick


 On 11/3/2012 10:51 PM, Erick Erickson wrote:

 SolrCloud doesn't work unless every shard has at least one server that is
 up and running.

 I _think_ you might be killing both nodes that host one of the shards. The
 admin
 page has a link showing you the state of your cluster. So when this
 happens,
 does that page show both nodes for that shard being down?

 And yeah, SolrCloud requires a quorum of ZK nodes up. So with only one ZK
 node, killing that will bring down the whole cluster. Which is why the
 usual
 recommendation is that ZK be run externally and usually an odd number of
 ZK
 nodes (three or more).

 Anyone can create a login and edit the Wiki, so any clarifications are
 welcome!

 Best
 Erick


 On Sat, Nov 3, 2012 at 12:17 PM, Nick Chase nch...@earthlink.net wrote:

  I think there's a change in the behavior of SolrCloud vs. what's in the
 wiki, but I was hoping someone could confirm for me.  I checked JIRA and
 there were a couple of issues requesting partial results if one server
 comes down, but that doesn't seem to be the issue here.  I also checked
 CHANGES.txt and don't see anything that seems to apply.

 I'm running Example B: Simple two shard cluster with shard replicas
 from
 the wiki at 
 https://wiki.apache.org/solr/SolrCloudhttps://wiki.apache.org/solr/**SolrCloud
 https://wiki.**apache.org/solr/SolrCloudhttps://wiki.apache.org/solr/SolrCloudand
 everything starts out as expected.  However, when I get to the part

 about fail over behavior is when things get a little wonky.

 I added data to the shard running on 7475.  If I kill 7500, a query to
 any
 of the other servers works fine.  But if I kill 7475, rather than getting
 zero results on a search to 8983 or 8900, I get a 503 error:

 response
 lst name=responseHeader
int name=status503/int
int name=QTime5/int
lst name=params
   str name=q*:*/str
/lst
 /lst
 lst name=error
str name=msgno servers hosting shard:/str
int name=code503/int
 /lst
 /response

 I don't see any errors in the consoles.

 Also, if I kill 8983, which includes the Zookeeper server, everything
 dies, rather than just staying in a steady state; the other servers
 continually show:

 Nov 03, 2012 11:39:34 AM org.apache.zookeeper.ClientCnxn$SendThread

 startConnect
 NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983
 ov 03, 2012 11:39:35 AM org.apache.zookeeper.ClientCnxn$SendThread
 run

 ARNING: Session 0x13ac6cf87890002 for server null, unexpected error,
 closing socket connection and attempting reconnect
 ava.net.ConnectException: Connection refused: no further information
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown
 Source)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(**
 ClientCnxn.java:1143)

 ov 03, 2012 11:39:35 AM org.apache.zookeeper.ClientCnxn$SendThread

 startConnect

 over and over again, and a call to any of the servers shows a connection
 error to 8983.

 This is the current 4.0.0 release, running on Windows 7.

 If this is the proper behavior and the wiki needs updating, fine; I just
 need to know.  Otherwise if anybody has any clues as to what I may be
 missing, I'd be grateful. :)

 Thanks...

 ---  Nick





SolrCloud failover behavior

2012-11-03 Thread Nick Chase
I think there's a change in the behavior of SolrCloud vs. what's in the 
wiki, but I was hoping someone could confirm for me.  I checked JIRA and 
there were a couple of issues requesting partial results if one server 
comes down, but that doesn't seem to be the issue here.  I also checked 
CHANGES.txt and don't see anything that seems to apply.


I'm running Example B: Simple two shard cluster with shard replicas 
from the wiki at https://wiki.apache.org/solr/SolrCloud and everything 
starts out as expected.  However, when I get to the part about fail over 
behavior is when things get a little wonky.


I added data to the shard running on 7475.  If I kill 7500, a query to 
any of the other servers works fine.  But if I kill 7475, rather than 
getting zero results on a search to 8983 or 8900, I get a 503 error:


response
   lst name=responseHeader
  int name=status503/int
  int name=QTime5/int
  lst name=params
 str name=q*:*/str
  /lst
   /lst
   lst name=error
  str name=msgno servers hosting shard:/str
  int name=code503/int
   /lst
/response

I don't see any errors in the consoles.

Also, if I kill 8983, which includes the Zookeeper server, everything 
dies, rather than just staying in a steady state; the other servers 
continually show:


Nov 03, 2012 11:39:34 AM org.apache.zookeeper.ClientCnxn$SendThread 
startConnect

NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983
ov 03, 2012 11:39:35 AM org.apache.zookeeper.ClientCnxn$SendThread run
ARNING: Session 0x13ac6cf87890002 for server null, unexpected error, 
closing socket connection and attempting reconnect

ava.net.ConnectException: Connection refused: no further information
   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
   at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1143)


ov 03, 2012 11:39:35 AM org.apache.zookeeper.ClientCnxn$SendThread 
startConnect


over and over again, and a call to any of the servers shows a connection 
error to 8983.


This is the current 4.0.0 release, running on Windows 7.

If this is the proper behavior and the wiki needs updating, fine; I just 
need to know.  Otherwise if anybody has any clues as to what I may be 
missing, I'd be grateful. :)


Thanks...

---  Nick


Re: SolrCloud failover behavior

2012-11-03 Thread Erick Erickson
SolrCloud doesn't work unless every shard has at least one server that is
up and running.

I _think_ you might be killing both nodes that host one of the shards. The
admin
page has a link showing you the state of your cluster. So when this happens,
does that page show both nodes for that shard being down?

And yeah, SolrCloud requires a quorum of ZK nodes up. So with only one ZK
node, killing that will bring down the whole cluster. Which is why the
usual
recommendation is that ZK be run externally and usually an odd number of ZK
nodes (three or more).

Anyone can create a login and edit the Wiki, so any clarifications are
welcome!

Best
Erick


On Sat, Nov 3, 2012 at 12:17 PM, Nick Chase nch...@earthlink.net wrote:

 I think there's a change in the behavior of SolrCloud vs. what's in the
 wiki, but I was hoping someone could confirm for me.  I checked JIRA and
 there were a couple of issues requesting partial results if one server
 comes down, but that doesn't seem to be the issue here.  I also checked
 CHANGES.txt and don't see anything that seems to apply.

 I'm running Example B: Simple two shard cluster with shard replicas from
 the wiki at 
 https://wiki.apache.org/solr/**SolrCloudhttps://wiki.apache.org/solr/SolrCloudand
  everything starts out as expected.  However, when I get to the part
 about fail over behavior is when things get a little wonky.

 I added data to the shard running on 7475.  If I kill 7500, a query to any
 of the other servers works fine.  But if I kill 7475, rather than getting
 zero results on a search to 8983 or 8900, I get a 503 error:

 response
lst name=responseHeader
   int name=status503/int
   int name=QTime5/int
   lst name=params
  str name=q*:*/str
   /lst
/lst
lst name=error
   str name=msgno servers hosting shard:/str
   int name=code503/int
/lst
 /response

 I don't see any errors in the consoles.

 Also, if I kill 8983, which includes the Zookeeper server, everything
 dies, rather than just staying in a steady state; the other servers
 continually show:

 Nov 03, 2012 11:39:34 AM org.apache.zookeeper.**ClientCnxn$SendThread
 startConnect
 NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983
 ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread run
 ARNING: Session 0x13ac6cf87890002 for server null, unexpected error,
 closing socket connection and attempting reconnect
 ava.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.**checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.**finishConnect(Unknown Source)
at org.apache.zookeeper.**ClientCnxn$SendThread.run(**
 ClientCnxn.java:1143)

 ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread
 startConnect

 over and over again, and a call to any of the servers shows a connection
 error to 8983.

 This is the current 4.0.0 release, running on Windows 7.

 If this is the proper behavior and the wiki needs updating, fine; I just
 need to know.  Otherwise if anybody has any clues as to what I may be
 missing, I'd be grateful. :)

 Thanks...

 ---  Nick