Thanks a million, Erick! You're right about killing both nodes hosting the shard. I'll get the wiki corrected.

----  Nick

On 11/3/2012 10:51 PM, Erick Erickson wrote:
SolrCloud doesn't work unless every shard has at least one server that is
up and running.

I _think_ you might be killing both nodes that host one of the shards. The
admin
page has a link showing you the state of your cluster. So when this happens,
does that page show both nodes for that shard being down?

And yeah, SolrCloud requires a quorum of ZK nodes up. So with only one ZK
node, killing that will bring down the whole cluster. Which is why the
usual
recommendation is that ZK be run externally and usually an odd number of ZK
nodes (three or more).

Anyone can create a login and edit the Wiki, so any clarifications are
welcome!

Best
Erick


On Sat, Nov 3, 2012 at 12:17 PM, Nick Chase <nch...@earthlink.net> wrote:

I think there's a change in the behavior of SolrCloud vs. what's in the
wiki, but I was hoping someone could confirm for me.  I checked JIRA and
there were a couple of issues requesting partial results if one server
comes down, but that doesn't seem to be the issue here.  I also checked
CHANGES.txt and don't see anything that seems to apply.

I'm running "Example B: Simple two shard cluster with shard replicas" from
the wiki at 
https://wiki.apache.org/solr/**SolrCloud<https://wiki.apache.org/solr/SolrCloud>and
 everything starts out as expected.  However, when I get to the part
about fail over behavior is when things get a little wonky.

I added data to the shard running on 7475.  If I kill 7500, a query to any
of the other servers works fine.  But if I kill 7475, rather than getting
zero results on a search to 8983 or 8900, I get a 503 error:

<response>
    <lst name="responseHeader">
       <int name="status">503</int>
       <int name="QTime">5</int>
       <lst name="params">
          <str name="q">*:*</str>
       </lst>
    </lst>
    <lst name="error">
       <str name="msg">no servers hosting shard:</str>
       <int name="code">503</int>
    </lst>
</response>

I don't see any errors in the consoles.

Also, if I kill 8983, which includes the Zookeeper server, everything
dies, rather than just staying in a steady state; the other servers
continually show:

Nov 03, 2012 11:39:34 AM org.apache.zookeeper.**ClientCnxn$SendThread
startConnect
NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983
ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread run
ARNING: Session 0x13ac6cf87890002 for server null, unexpected error,
closing socket connection and attempting reconnect
ava.net.ConnectException: Connection refused: no further information
        at sun.nio.ch.SocketChannelImpl.**checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.**finishConnect(Unknown Source)
        at org.apache.zookeeper.**ClientCnxn$SendThread.run(**
ClientCnxn.java:1143)

ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread
startConnect

over and over again, and a call to any of the servers shows a connection
error to 8983.

This is the current 4.0.0 release, running on Windows 7.

If this is the proper behavior and the wiki needs updating, fine; I just
need to know.  Otherwise if anybody has any clues as to what I may be
missing, I'd be grateful. :)

Thanks...

---  Nick


Reply via email to