[ 
https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793358#action_12793358
 ] 

Jason Rutherglen commented on SOLR-1277:
----------------------------------------

{quote}Zookeeper gives us the layout of the cluster. It doesn't
seem like we need (yet) fast failure detection from zookeeper -
other nodes can do this synchronously themselves (and would need
to anyway) on things like connection failures. App-level
timeouts should not mark the node as failed since we don't know
how long the request was supposed to take.{quote}

Google Chubby when used in conjunction with search sets a high
timeout of 60 seconds I believe?

Fast failover is difficult so it'll be best to enable fast
re-requesting to adjacent slave servers on request failure. 

Mahadev has some good advise about how we can separate the logic
into different znodes. Going further I think we'll want to allow
cores to register themselves, then listen to a separate
directory as to what state each should be in. We'll need to
insure the architecture allows for defining multiple tiers (like a pyramid).

At http://wiki.apache.org/solr/ZooKeeperIntegration is a node a
core or a server/corecontainer?

To move ahead we'll really need to define and settle on the
directory and file structure. I believe the requirement of
grouping cores so that one may issue a search against a group
name, instead of individual shard names will be useful. The
ability to move cores to different nodes will be necessary, as
is the ability to replicate cores (i.e. have multiple copies
available on different servers). 

Today I deploy lots of cores today from HDFS across quite a few
servers containing 1.6 billion documents representing at least
2.4 TB of data. I mention this because a lot can potentially go
wrong in this type of setup (i.e. server's going down, corrupted
data, intermittent network, etc) I generate a file that contains
all the information as to which core should go to which Solr
server using size based balancing. Ideally I'd be able to
generate a new file, perhaps for load balancing the cores across
new Solr servers or to define that hot cores should be
replicated, and the Solr cluster would move the cores to the
defined servers automatically. This doesn't include the separate
set of servers system that handles incremental updates (i.e.
master -> slave). 

There's a bit of trepidation in moving forward on this because
we don't want to engineer ourselves into a hole, however if we
need to change the structure of the znodes in the future, we'll
need a healthy a versioning plan such that one may upgrade a
cluster while maintaining backwards compatibility on a live
system. Lets think of a basic plan for this. 

In conclusion, lets iterate on the directory structure via the
wiki or this issue?

{quote}A search node can have very large caches tied to readers
that all drop at once on commit, and can require a much larger
heap to accommodate these caches. I think thats a more common
scenario that creates these longer pauses.{quote}

The large cache issue should be fixable with the various NRT
changes SOLR-1606. They're collectively not much different than
the search and sort per segment changes made to Lucene 2.9. 

> Implement a Solr specific naming service (using Zookeeper)
> ----------------------------------------------------------
>
>                 Key: SOLR-1277
>                 URL: https://issues.apache.org/jira/browse/SOLR-1277
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, 
> SOLR-1277.patch, SOLR-1277.patch, zookeeper-3.2.1.jar
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The goal is to give Solr server clusters self-healing attributes
> where if a server fails, indexing and searching don't stop and
> all of the partitions remain searchable. For configuration, the
> ability to centrally deploy a new configuration without servers
> going offline.
> We can start with basic failover and start from there?
> Features:
> * Automatic failover (i.e. when a server fails, clients stop
> trying to index to or search it)
> * Centralized configuration management (i.e. new solrconfig.xml
> or schema.xml propagates to a live Solr cluster)
> * Optionally allow shards of a partition to be moved to another
> server (i.e. if a server gets hot, move the hot segments out to
> cooler servers). Ideally we'd have a way to detect hot segments
> and move them seamlessly. With NRT this becomes somewhat more
> difficult but not impossible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to