[jira] Commented: (SOLR-1277) Implement a Solr specific naming service (using Zookeeper)

Patrick Hunt (JIRA) Wed, 02 Dec 2009 11:17:45 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784967#action_12784967
 ]


Patrick Hunt commented on SOLR-1277:
------------------------------------

bq. Patrick, how low is it feasible to set the timeout? Could it be set low 
enough that it could be the only input to a failover decision in the case of a 
very high query load? That is, say a cluster with 3 query slaves is handling 
600 queries per second, which means each is getting 200qps, or one every 5ms on 
average. If a slave were to fail, queries will start backing up pretty quickly 
unless a decision is made to drop the failed node within 500ms or so. Clearly, 
whatever node is distributing the queries to the slaves can make the failed 
node down (say, in the case of a HW load balancer), but could we rely on ZK to 
handle this for us?

See https://issues.apache.org/jira/browse/ZOOKEEPER-601 for background

Typically you will have a server ticktime of 2 seconds, so min that the server 
allows currently is 4 seconds. This means that the client will send a ping 
every 4/3 seconds, waiting up to 4/3 seconds for a response before it considers 
the server down. The server of course will expire the session after 4 seconds 
in this case.

It should work (say 601 is fixed) but I would not encourage you to go down this 
road, instead you can do something better (although I don't know enough about 
solr, perhaps this is worse, it may also depend on whether/what hw load 
balancer you have)

Rather I would suggest that you do something similar to the lease - 
periodically publish some load information from the query slaves to zk. Every 
250ms your query slave could push an update that says "I am doing Xqps 
currentl" If you don't see an update in 500ms maybe you consider the slave dead 
till it comes back (updates the znode again). If you don't have a hwLB you 
might even be able to take advantage of this information when passing queries 
to slaves. Worst case scenario you could expose this information through a 
dashboard, giving good insight into solr workings to an operator.

Each slave is doing 4 updates to zk per second in this case. You are more 
reliant on having a stable framework for ZK, keep that in mind (the cluster 
must be performant, low gc pauses in zk itself (ie tune the gc properly) etc...)

See my zk service latency review for what you should expect re latencies in 
some situations: http://bit.ly/4ekN8G

> Implement a Solr specific naming service (using Zookeeper)
> ----------------------------------------------------------
>
>                 Key: SOLR-1277
>                 URL: https://issues.apache.org/jira/browse/SOLR-1277
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, 
> SOLR-1277.patch, zookeeper-3.2.1.jar
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The goal is to give Solr server clusters self-healing attributes
> where if a server fails, indexing and searching don't stop and
> all of the partitions remain searchable. For configuration, the
> ability to centrally deploy a new configuration without servers
> going offline.
> We can start with basic failover and start from there?
> Features:
> * Automatic failover (i.e. when a server fails, clients stop
> trying to index to or search it)
> * Centralized configuration management (i.e. new solrconfig.xml
> or schema.xml propagates to a live Solr cluster)
> * Optionally allow shards of a partition to be moved to another
> server (i.e. if a server gets hot, move the hot segments out to
> cooler servers). Ideally we'd have a way to detect hot segments
> and move them seamlessly. With NRT this becomes somewhat more
> difficult but not impossible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1277) Implement a Solr specific naming service (using Zookeeper)

Reply via email to