[ 
https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791474#action_12791474
 ] 

Mark Miller commented on SOLR-1277:
-----------------------------------

bq. Not sure I understand... for group membership, I had assumed there would be 
an ephemeral znode per node. Zookeeper does pings, and deletes the znode when 
the session expires, but those aren't "updates" per se.

Right - thats the problem I want to address. Ephemeral nodes go away when the 
client times out - with a low timeout, you can learn relatively fast that a 
node is down. But because we may have long gc pauses, a low timeout will cause 
false "down" reports. And we have to handle reconnection's. But if we raise the 
timeout to get around these gc pauses, if there really is a problem, it will 
take a long time to learn about it. One of the recommendations above was to use 
a lease system instead, where each node does these updates. I'm trying to 
determine which strategy we actually want to use. Another option given was to 
let the gc cause a timeout, and then reconnect - but Solr has to "wait" for the 
reconnection to occur before it can access ZooKeeper again.

{quote}
Zookeeper client->server timeouts? Or Solr node->node request timeouts?
Zookeeper timeouts need to be handled on a per-case basis - we should design 
such that most of the time we can continue operating even if we can't talk to 
zookeeper.
{quote}

Zookeeper client->server timeouts

But as you say above, if a client times out, its ephemeral node goes down, and 
that shard will no longer be participating in distrib requests hitting other 
servers (presumably). How can we continue operating? We won't know which shards 
to hit (I guess we could use the "old" shards list?) and we won't be part of 
distributed requests from other shards, because our ephemeral node will be 
removed ...

I'm ref'ing to Patrick Hunt's comments above. Perhaps, because recovery won't 
be expensive, thats what we want to do - but Solr won't be able to access 
ZooKeeper until its recovered - so I guess for that brief period, we drop out 
of other distrib requests, and if we get hit, we just use the old shards list 
for requests that hit the dropped server?

> Implement a Solr specific naming service (using Zookeeper)
> ----------------------------------------------------------
>
>                 Key: SOLR-1277
>                 URL: https://issues.apache.org/jira/browse/SOLR-1277
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, 
> SOLR-1277.patch, SOLR-1277.patch, zookeeper-3.2.1.jar
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The goal is to give Solr server clusters self-healing attributes
> where if a server fails, indexing and searching don't stop and
> all of the partitions remain searchable. For configuration, the
> ability to centrally deploy a new configuration without servers
> going offline.
> We can start with basic failover and start from there?
> Features:
> * Automatic failover (i.e. when a server fails, clients stop
> trying to index to or search it)
> * Centralized configuration management (i.e. new solrconfig.xml
> or schema.xml propagates to a live Solr cluster)
> * Optionally allow shards of a partition to be moved to another
> server (i.e. if a server gets hot, move the hot segments out to
> cooler servers). Ideally we'd have a way to detect hot segments
> and move them seamlessly. With NRT this becomes somewhat more
> difficult but not impossible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to