Re: SolrCloud failover behavior
Thanks a million, Erick! You're right about killing both nodes hosting the shard. I'll get the wiki corrected. Nick On 11/3/2012 10:51 PM, Erick Erickson wrote: SolrCloud doesn't work unless every shard has at least one server that is up and running. I _think_ you might be killing both nodes that host one of the shards. The admin page has a link showing you the state of your cluster. So when this happens, does that page show both nodes for that shard being down? And yeah, SolrCloud requires a quorum of ZK nodes up. So with only one ZK node, killing that will bring down the whole cluster. Which is why the usual recommendation is that ZK be run externally and usually an odd number of ZK nodes (three or more). Anyone can create a login and edit the Wiki, so any clarifications are welcome! Best Erick On Sat, Nov 3, 2012 at 12:17 PM, Nick Chase nch...@earthlink.net wrote: I think there's a change in the behavior of SolrCloud vs. what's in the wiki, but I was hoping someone could confirm for me. I checked JIRA and there were a couple of issues requesting partial results if one server comes down, but that doesn't seem to be the issue here. I also checked CHANGES.txt and don't see anything that seems to apply. I'm running Example B: Simple two shard cluster with shard replicas from the wiki at https://wiki.apache.org/solr/**SolrCloudhttps://wiki.apache.org/solr/SolrCloudand everything starts out as expected. However, when I get to the part about fail over behavior is when things get a little wonky. I added data to the shard running on 7475. If I kill 7500, a query to any of the other servers works fine. But if I kill 7475, rather than getting zero results on a search to 8983 or 8900, I get a 503 error: response lst name=responseHeader int name=status503/int int name=QTime5/int lst name=params str name=q*:*/str /lst /lst lst name=error str name=msgno servers hosting shard:/str int name=code503/int /lst /response I don't see any errors in the consoles. Also, if I kill 8983, which includes the Zookeeper server, everything dies, rather than just staying in a steady state; the other servers continually show: Nov 03, 2012 11:39:34 AM org.apache.zookeeper.**ClientCnxn$SendThread startConnect NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983 ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread run ARNING: Session 0x13ac6cf87890002 for server null, unexpected error, closing socket connection and attempting reconnect ava.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.**checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.**finishConnect(Unknown Source) at org.apache.zookeeper.**ClientCnxn$SendThread.run(** ClientCnxn.java:1143) ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread startConnect over and over again, and a call to any of the servers shows a connection error to 8983. This is the current 4.0.0 release, running on Windows 7. If this is the proper behavior and the wiki needs updating, fine; I just need to know. Otherwise if anybody has any clues as to what I may be missing, I'd be grateful. :) Thanks... --- Nick
Re: SolrCloud failover behavior
I was right for once G.. Thanks for updating the Wiki! Erick On Tue, Nov 6, 2012 at 9:42 AM, Nick Chase nch...@earthlink.net wrote: Thanks a million, Erick! You're right about killing both nodes hosting the shard. I'll get the wiki corrected. Nick On 11/3/2012 10:51 PM, Erick Erickson wrote: SolrCloud doesn't work unless every shard has at least one server that is up and running. I _think_ you might be killing both nodes that host one of the shards. The admin page has a link showing you the state of your cluster. So when this happens, does that page show both nodes for that shard being down? And yeah, SolrCloud requires a quorum of ZK nodes up. So with only one ZK node, killing that will bring down the whole cluster. Which is why the usual recommendation is that ZK be run externally and usually an odd number of ZK nodes (three or more). Anyone can create a login and edit the Wiki, so any clarifications are welcome! Best Erick On Sat, Nov 3, 2012 at 12:17 PM, Nick Chase nch...@earthlink.net wrote: I think there's a change in the behavior of SolrCloud vs. what's in the wiki, but I was hoping someone could confirm for me. I checked JIRA and there were a couple of issues requesting partial results if one server comes down, but that doesn't seem to be the issue here. I also checked CHANGES.txt and don't see anything that seems to apply. I'm running Example B: Simple two shard cluster with shard replicas from the wiki at https://wiki.apache.org/solr/SolrCloudhttps://wiki.apache.org/solr/**SolrCloud https://wiki.**apache.org/solr/SolrCloudhttps://wiki.apache.org/solr/SolrCloudand everything starts out as expected. However, when I get to the part about fail over behavior is when things get a little wonky. I added data to the shard running on 7475. If I kill 7500, a query to any of the other servers works fine. But if I kill 7475, rather than getting zero results on a search to 8983 or 8900, I get a 503 error: response lst name=responseHeader int name=status503/int int name=QTime5/int lst name=params str name=q*:*/str /lst /lst lst name=error str name=msgno servers hosting shard:/str int name=code503/int /lst /response I don't see any errors in the consoles. Also, if I kill 8983, which includes the Zookeeper server, everything dies, rather than just staying in a steady state; the other servers continually show: Nov 03, 2012 11:39:34 AM org.apache.zookeeper.ClientCnxn$SendThread startConnect NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983 ov 03, 2012 11:39:35 AM org.apache.zookeeper.ClientCnxn$SendThread run ARNING: Session 0x13ac6cf87890002 for server null, unexpected error, closing socket connection and attempting reconnect ava.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) at org.apache.zookeeper.ClientCnxn$SendThread.run(** ClientCnxn.java:1143) ov 03, 2012 11:39:35 AM org.apache.zookeeper.ClientCnxn$SendThread startConnect over and over again, and a call to any of the servers shows a connection error to 8983. This is the current 4.0.0 release, running on Windows 7. If this is the proper behavior and the wiki needs updating, fine; I just need to know. Otherwise if anybody has any clues as to what I may be missing, I'd be grateful. :) Thanks... --- Nick
SolrCloud failover behavior
I think there's a change in the behavior of SolrCloud vs. what's in the wiki, but I was hoping someone could confirm for me. I checked JIRA and there were a couple of issues requesting partial results if one server comes down, but that doesn't seem to be the issue here. I also checked CHANGES.txt and don't see anything that seems to apply. I'm running Example B: Simple two shard cluster with shard replicas from the wiki at https://wiki.apache.org/solr/SolrCloud and everything starts out as expected. However, when I get to the part about fail over behavior is when things get a little wonky. I added data to the shard running on 7475. If I kill 7500, a query to any of the other servers works fine. But if I kill 7475, rather than getting zero results on a search to 8983 or 8900, I get a 503 error: response lst name=responseHeader int name=status503/int int name=QTime5/int lst name=params str name=q*:*/str /lst /lst lst name=error str name=msgno servers hosting shard:/str int name=code503/int /lst /response I don't see any errors in the consoles. Also, if I kill 8983, which includes the Zookeeper server, everything dies, rather than just staying in a steady state; the other servers continually show: Nov 03, 2012 11:39:34 AM org.apache.zookeeper.ClientCnxn$SendThread startConnect NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983 ov 03, 2012 11:39:35 AM org.apache.zookeeper.ClientCnxn$SendThread run ARNING: Session 0x13ac6cf87890002 for server null, unexpected error, closing socket connection and attempting reconnect ava.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1143) ov 03, 2012 11:39:35 AM org.apache.zookeeper.ClientCnxn$SendThread startConnect over and over again, and a call to any of the servers shows a connection error to 8983. This is the current 4.0.0 release, running on Windows 7. If this is the proper behavior and the wiki needs updating, fine; I just need to know. Otherwise if anybody has any clues as to what I may be missing, I'd be grateful. :) Thanks... --- Nick
Re: SolrCloud failover behavior
SolrCloud doesn't work unless every shard has at least one server that is up and running. I _think_ you might be killing both nodes that host one of the shards. The admin page has a link showing you the state of your cluster. So when this happens, does that page show both nodes for that shard being down? And yeah, SolrCloud requires a quorum of ZK nodes up. So with only one ZK node, killing that will bring down the whole cluster. Which is why the usual recommendation is that ZK be run externally and usually an odd number of ZK nodes (three or more). Anyone can create a login and edit the Wiki, so any clarifications are welcome! Best Erick On Sat, Nov 3, 2012 at 12:17 PM, Nick Chase nch...@earthlink.net wrote: I think there's a change in the behavior of SolrCloud vs. what's in the wiki, but I was hoping someone could confirm for me. I checked JIRA and there were a couple of issues requesting partial results if one server comes down, but that doesn't seem to be the issue here. I also checked CHANGES.txt and don't see anything that seems to apply. I'm running Example B: Simple two shard cluster with shard replicas from the wiki at https://wiki.apache.org/solr/**SolrCloudhttps://wiki.apache.org/solr/SolrCloudand everything starts out as expected. However, when I get to the part about fail over behavior is when things get a little wonky. I added data to the shard running on 7475. If I kill 7500, a query to any of the other servers works fine. But if I kill 7475, rather than getting zero results on a search to 8983 or 8900, I get a 503 error: response lst name=responseHeader int name=status503/int int name=QTime5/int lst name=params str name=q*:*/str /lst /lst lst name=error str name=msgno servers hosting shard:/str int name=code503/int /lst /response I don't see any errors in the consoles. Also, if I kill 8983, which includes the Zookeeper server, everything dies, rather than just staying in a steady state; the other servers continually show: Nov 03, 2012 11:39:34 AM org.apache.zookeeper.**ClientCnxn$SendThread startConnect NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983 ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread run ARNING: Session 0x13ac6cf87890002 for server null, unexpected error, closing socket connection and attempting reconnect ava.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.**checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.**finishConnect(Unknown Source) at org.apache.zookeeper.**ClientCnxn$SendThread.run(** ClientCnxn.java:1143) ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread startConnect over and over again, and a call to any of the servers shows a connection error to 8983. This is the current 4.0.0 release, running on Windows 7. If this is the proper behavior and the wiki needs updating, fine; I just need to know. Otherwise if anybody has any clues as to what I may be missing, I'd be grateful. :) Thanks... --- Nick