[ https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784967#action_12784967 ]
Patrick Hunt commented on SOLR-1277: ------------------------------------ bq. Patrick, how low is it feasible to set the timeout? Could it be set low enough that it could be the only input to a failover decision in the case of a very high query load? That is, say a cluster with 3 query slaves is handling 600 queries per second, which means each is getting 200qps, or one every 5ms on average. If a slave were to fail, queries will start backing up pretty quickly unless a decision is made to drop the failed node within 500ms or so. Clearly, whatever node is distributing the queries to the slaves can make the failed node down (say, in the case of a HW load balancer), but could we rely on ZK to handle this for us? See https://issues.apache.org/jira/browse/ZOOKEEPER-601 for background Typically you will have a server ticktime of 2 seconds, so min that the server allows currently is 4 seconds. This means that the client will send a ping every 4/3 seconds, waiting up to 4/3 seconds for a response before it considers the server down. The server of course will expire the session after 4 seconds in this case. It should work (say 601 is fixed) but I would not encourage you to go down this road, instead you can do something better (although I don't know enough about solr, perhaps this is worse, it may also depend on whether/what hw load balancer you have) Rather I would suggest that you do something similar to the lease - periodically publish some load information from the query slaves to zk. Every 250ms your query slave could push an update that says "I am doing Xqps currentl" If you don't see an update in 500ms maybe you consider the slave dead till it comes back (updates the znode again). If you don't have a hwLB you might even be able to take advantage of this information when passing queries to slaves. Worst case scenario you could expose this information through a dashboard, giving good insight into solr workings to an operator. Each slave is doing 4 updates to zk per second in this case. You are more reliant on having a stable framework for ZK, keep that in mind (the cluster must be performant, low gc pauses in zk itself (ie tune the gc properly) etc...) See my zk service latency review for what you should expect re latencies in some situations: http://bit.ly/4ekN8G > Implement a Solr specific naming service (using Zookeeper) > ---------------------------------------------------------- > > Key: SOLR-1277 > URL: https://issues.apache.org/jira/browse/SOLR-1277 > Project: Solr > Issue Type: New Feature > Affects Versions: 1.4 > Reporter: Jason Rutherglen > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 1.5 > > Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, > SOLR-1277.patch, zookeeper-3.2.1.jar > > Original Estimate: 672h > Remaining Estimate: 672h > > The goal is to give Solr server clusters self-healing attributes > where if a server fails, indexing and searching don't stop and > all of the partitions remain searchable. For configuration, the > ability to centrally deploy a new configuration without servers > going offline. > We can start with basic failover and start from there? > Features: > * Automatic failover (i.e. when a server fails, clients stop > trying to index to or search it) > * Centralized configuration management (i.e. new solrconfig.xml > or schema.xml propagates to a live Solr cluster) > * Optionally allow shards of a partition to be moved to another > server (i.e. if a server gets hot, move the hot segments out to > cooler servers). Ideally we'd have a way to detect hot segments > and move them seamlessly. With NRT this becomes somewhat more > difficult but not impossible? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.