Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "SolrCloud" page has been changed by jayqhacker: http://wiki.apache.org/solr/SolrCloud?action=diff&rev1=77&rev2=78 Comment: Mention that zkRun localhost default does not work for a distributed ensemble. ## page was renamed from SolrCloud2 {{http://people.apache.org/~markrmiller/2shard4serverFull.jpg}} - <<TableOfContents()>> - == SolrCloud == - SolrCloud is the name of a set of new distributed capabilities in Solr. Passing parameters to enable these capabilities will enable you to set up a highly available, fault tolerant cluster of Solr servers. Use SolrCloud when you want high scale, fault tolerant, distributed indexing and search capabilities. Look at the following 'Getting Started' section to quickly learn how to start up a cluster. There are 3 quick examples to follow, each showing how to startup a progressively more complicated cluster. After checking out the examples, consult the following sections for more detailed information as needed. @@ -81, +78 @@ This example will simply build off of the previous example by creating another copy of shard1 and shard2. Extra shard copies can be used for high availability and fault tolerance, or simply for increasing the query capacity of the cluster. - First, run through the previous example so we already have two shards and some documents indexed into each. Then simply make a copy of those two servers: {{{ @@ -150, +146 @@ }}} Now since we are running three embedded zookeeper servers as an ensemble, everything can keep working even if a server is lost. To demonstrate this, kill the exampleB server by pressing CTRL+C in it's window and then browse to the [[http://localhost:8983/solr/#/~cloud|Solr Zookeeper Admin UI]] to verify that the zookeeper service still works. + Note that when running on multiple hosts, you will need to set `-DzkRun=hostname:port` on each host to the exact name and port used in `-DzkHost` -- the default `localhost` will not work. + == ZooKeeper == - Multiple Zookeeper servers running together for fault tolerance and high availability is called an ensemble. For production, it's recommended that you run an external zookeeper ensemble rather than having Solr run embedded servers. See the [[http://zookeeper.apache.org/|Apache ZooKeeper]] site for more information on downloading and running a zookeeper ensemble. More specifically, try [[http://zookeeper.apache.org/doc/r3.3.4/zookeeperStarted.html|Getting Started]] and [[http://zookeeper.apache.org/doc/r3.3.4/zookeeperAdmin.html|ZooKeeper Admin]]. It's actually pretty simple to get going. You can stick to having Solr run ZooKeeper, but keep in mind that a ZooKeeper cluster is not easily changed dynamically. Until further support is added to ZooKeeper, changes are best done with rolling restarts. Handling this in a separate process from Solr will usually be preferable. + Multiple Zookeeper servers running together for fault tolerance and high availability is called an ensemble. For production, it's recommended that you run an external zookeeper ensemble rather than having Solr run embedded servers. See the [[http://zookeeper.apache.org/|Apache ZooKeeper]] site for more information on downloading and running a zookeeper ensemble. More specifically, try [[http://zookeeper.apache.org/doc/r3.3.4/zookeeperStarted.html|Getting Started]] and [[http://zookeeper.apache.org/doc/r3.3.4/zookeeperAdmin.html|ZooKeeper Admin]]. It's actually pretty simple to get going. You can stick to having Solr run ZooKeeper, but keep in mind that a ZooKeeper cluster is not easily changed dynamically. Until further support is added to ZooKeeper, changes are best done with rolling restarts. Handling this in a separate process from Solr will usually be preferable. When Solr runs an embedded zookeeper server, it defaults to using the solr port plus 1000 for the zookeeper client port. In addition, it defaults to adding one to the client port for the zookeeper server port, and two for the zookeeper leader election port. So in the first example with Solr running at 8983, the embedded zookeeper server used port 9983 for the client port and 9984,9985 for the server ports. In terms of trying to make sure ZooKeeper is setup to be very fast, keep a few things in mind: Solr does not use ZooKeeper intensively - optimizations may not be necessary in many cases. Also, while adding more ZooKeeper nodes will help some with read performance, it will slightly hurt write performance. Again, Solr does not really do much with ZooKeeper when your cluster is in a steady state. If you do need to optimize ZooKeeper, here are a few helpful notes: 1. ZooKeeper works best when it has a dedicated machine. ZooKeeper is a timely service and a dedicated machine helps ensure timely responses. A dedicated machine is not required however. - 2. ZooKeeper works best when you put its transaction log and snap-shots on different disk drives. + 1. ZooKeeper works best when you put its transaction log and snap-shots on different disk drives. - 3. If you do colocate ZooKeeper with Solr, using separate disk drives for Solr and ZooKeeper will help with performance. + 1. If you do colocate ZooKeeper with Solr, using separate disk drives for Solr and ZooKeeper will help with performance. - == Managing collections via the Collections API == The collections API let's you manage collections. Under the hood, it generally uses the CoreAdmin API to manage SolrCores on each server - it's essentially sugar for actions that you could handle yourself if you made individual CoreAdmin API calls to each server you wanted an action to take place on. - Create - http://localhost:8983/solr/admin/collections?action=CREATE&name=mycollection&numShards=3&replicationFactor=4 + Create http://localhost:8983/solr/admin/collections?action=CREATE&name=mycollection&numShards=3&replicationFactor=4 - Note: replicationFactor defines the maximum number of replicas created in addition to the leader from amongst the nodes currently running (i.e. nodes added later will not be used for this collection). Imagine you have a cluster with 20 nodes and want to add an additional smaller collection to your installation with 2 shards, each shard with a leader and two replicas. You would specify a replicationFactor=2. Now six of your nodes will host this new collection and the other 14 will not host the new collection. + Note: replicationFactor defines the maximum number of replicas created in addition to the leader from amongst the nodes currently running (i.e. nodes added later will not be used for this collection). Imagine you have a cluster with 20 nodes and want to add an additional smaller collection to your installation with 2 shards, each shard with a leader and two replicas. You would specify a replicationFactor=2. Now six of your nodes will host this new collection and the other 14 will not host the new collection. - Delete - http://localhost:8983/solr/admin/collections?action=DELETE&name=mycollection + Delete http://localhost:8983/solr/admin/collections?action=DELETE&name=mycollection - Reload - http://localhost:8983/solr/admin/collections?action=RELOAD&name=mycollection + Reload http://localhost:8983/solr/admin/collections?action=RELOAD&name=mycollection == Creating cores via CoreAdmin == New Solr cores may also be created and associated with a collection via CoreAdmin. @@ -222, +216 @@ {{{ http://localhost:8983/solr/collection1/select?collection=collection1_NY,collection1_NJ,collection1_CT }}} - == Required Config == - All of the required config is already setup in the example configs shipped with Solr. The following is what you need to add if you are migrating old config files, or what you should not remove if you are starting with new config files. === schema.xml === - You must have a _version_ field defined: <field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/> === solrconfig.xml === - You must have an UpdateLog defined - this should be defined in the updateHandler section. {{{ <!-- Enables a transaction log, currently used for real-time get. "dir" - the target directory for transaction logs, defaults to the - solr data directory. --> + solr data directory. --> <updateLog> <str name="dir">${solr.data.dir:}</str> </updateLog> }}} - You must have a replication handler called /replication defined: {{{ - <requestHandler name="/replication" class="solr.ReplicationHandler" startup="lazy" /> + <requestHandler name="/replication" class="solr.ReplicationHandler" startup="lazy" /> }}} - You must have a realtime get handler called /get defined: + {{{ <requestHandler name="/get" class="solr.RealTimeGetHandler"> <lst name="defaults"> @@ -261, +250 @@ </requestHandler> }}} You must have the admin handlers defined: + {{{ <requestHandler name="/admin/" class="solr.admin.AdminHandlers" /> }}} And you must leave the admin path in solr.xml as the default: + {{{ <cores adminPath="/admin/cores" }}} @@ -278, +269 @@ <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> }}} - If you do not want the '''DistributedUpdateProcessFactory''' auto injected into your chain (say you want to use SolrCloud functionality, but you want to distribute updates yourself) then specify the following update processor factory in your chain: '''NoOpDistributingUpdateProcessorFactory''' == Re-sizing a Cluster == @@ -296, +286 @@ If you want to use the Near Realtime search support, you will probably want to enable auto soft commits in your solrconfig.xml file before putting it into zookeeper. Otherwise you can send explicit soft commits to the cluster as you desire. See NearRealtimeSearch == Parameter Reference == - === Cluster Params === - || numShards || Defaults to 1 || The number of shards to hash documents to. There will be one leader per shard and each leader can have N replicas. || + ||numShards ||Defaults to 1 ||The number of shards to hash documents to. There will be one leader per shard and each leader can have N replicas. || + === SolrCloud Instance Params === - These are set in solr.xml, but by default they are setup in solr.xml to also work with system properties. - - || host || Defaults to the first local host address found || If the wrong host address is found automatically, you can over ride the host address with this param. || + ||host ||Defaults to the first local host address found ||If the wrong host address is found automatically, you can over ride the host address with this param. || - || hostPort || Defaults to the jetty.port system property || The port that Solr is running on - by default this is found by looking at the jetty.port system property. || + ||hostPort ||Defaults to the jetty.port system property ||The port that Solr is running on - by default this is found by looking at the jetty.port system property. || - || hostContext || Defaults to solr || The context path for the Solr webapp. || + ||hostContext ||Defaults to solr ||The context path for the Solr webapp. || + + + === SolrCloud Instance ZooKeeper Params === - || zkRun || Defaults to localhost:<solrPort+1001> || Causes Solr to run an embedded version of ZooKeeper. Set to the address of ZooKeeper on this node - this allows us to know who 'we are' in the list of addresses in the zkHost connect string. Simply using -DzkRun gets you the default value. || + ||zkRun ||Defaults to localhost:<solrPort+1001> ||Causes Solr to run an embedded version of ZooKeeper. Set to the address of ZooKeeper on this node - this allows us to know who 'we are' in the list of addresses in the `zkHost` connect string. Simply using `-DzkRun` gets you the default value. Note this must be one of the exact strings from `zkHost`; in particular, the default `localhost` will not work for a multi-machine ensemble. || - || zkHost || No default || The host address for ZooKeeper - usually this should be a comma separated list of addresses to each node in your ZooKeeper ensemble. || + ||zkHost ||No default ||The host address for ZooKeeper - usually this should be a comma separated list of addresses to each node in your ZooKeeper ensemble. || - || zkClientTimeout || Defaults to 15000 || The time a client is allowed to not talk to ZooKeeper before having it's session expired. || + ||zkClientTimeout ||Defaults to 15000 ||The time a client is allowed to not talk to ZooKeeper before having it's session expired. || + zkRun and zkHost are setup using system properties. zkClientTimeout is setup in solr.xml, but default, can also be set using a system property. === SolrCloud Core Params === - || shardId || Defaults to being automatically assigned based on numShards || Allows you to specify the id used to group SolrCores into shards. || + ||shardId ||Defaults to being automatically assigned based on numShards ||Allows you to specify the id used to group SolrCores into shards. || + shardId can be configured in solr.xml for each core element as an attribute. == Getting your Configuration Files into ZooKeeper == - === Config Startup Bootstrap Params === - There are two different ways you can use system properties to upload your initial configuration files to ZooKeeper the first time you start Solr. Remember that these are meant to be used only on first startup or when overwriting configuration files - everytime you start Solr with these system properties, any current configuration files in ZooKeeper may be overwritten when 'conf set' names match. 1. Look at solr.xml and upload the conf for each SolrCore found. The 'config set' name will be the collection name for that SolrCore, and collections will use the 'config set' that has a matching name. - ||bootstrap_conf || No default || If you pass -Dbootstrap_conf=true on startup, each SolrCore you have configured will have it's configuration files automatically uploaded and linked to the collection that SolrCore is part of || + ||bootstrap_conf ||No default ||If you pass -Dbootstrap_conf=true on startup, each SolrCore you have configured will have it's configuration files automatically uploaded and linked to the collection that SolrCore is part of || + + + + 2. Upload the given directory as a 'conf set' with the given name. No linking of collection to 'config set' is done. However, if only one 'conf set' exists, a collection will auto link to it. - ||bootstrap_confdir || No default || If you pass -bootstrap_confdir=<directory> on startup, that specific directory of configuration files will be uploaded to ZooKeeper with a 'conf set' name defined by the below system property, collection.configName || + ||bootstrap_confdir ||No default ||If you pass -bootstrap_confdir=<directory> on startup, that specific directory of configuration files will be uploaded to ZooKeeper with a 'conf set' name defined by the below system property, collection.configName || - || collection.configName || Defaults to configuration1 || Determines the name of the conf set pointed to by bootstrap_confdir || + ||collection.configName ||Defaults to configuration1 ||Determines the name of the conf set pointed to by bootstrap_confdir || + + + === Command Line Util === The CLI tool also lets you upload config to ZooKeeper. It allows you to do it the same two ways that you can above. It also provides a few other commands that let you link collection sets to collections, make ZooKeeper paths or clear them, as well as download configs from ZooKeeper to the local filesystem. @@ -348, +346 @@ -s,--solrhome <arg> for bootstrap, runzk: solrhome location -z,--zkhost <arg> ZooKeeper host address }}} - ==== Examples ==== {{{ # try uploading a conf dir @@ -364, +361 @@ }}} ==== Scripts ==== There are scripts in example/cloud-scripts that handle the classpath and class name for you if you are using Solr out of the box with Jetty. Cmds then become: + {{{ sh zkcli.sh -cmd linkconfig -zkhost 127.0.0.1:9983 -collection collection1 -confname conf1 -solrhome example/solr }}} - == Known Limitations == - A small number of Solr search components do not support distributed search. In some cases, a component may never get distributed support, in other cases it may just be a matter of time and effort. All of the search components that do not yet support standard distributed search have the same limitation with SolrCloud. You can pass distrib=false to use these components on a single SolrCore. The Grouping feature only works if groups are in the same shard. Proper support will require custom hashing and there is already a JIRA issue working towards this. == Glossary == - - ||'''Collection''':|| A single search index.|| + ||'''Collection''': ||A single search index. || - ||'''Shard''':|| Either a logical or physical section of a single index depending on context. A logical section is also called a slice. A physical shard is expressed as a SolrCore.|| + ||'''Shard''': ||Either a logical or physical section of a single index depending on context. A logical section is also called a slice. A physical shard is expressed as a SolrCore. || - ||'''Slice''':|| A logical section of a single index. One or more identical, physical shards make up a slice.|| + ||'''Slice''': ||A logical section of a single index. One or more identical, physical shards make up a slice. || - ||'''SolrCore''':|| Encapsulates a single physical index. One or more make up logical shards (or slices) which make up a collection.|| + ||'''SolrCore''': ||Encapsulates a single physical index. One or more make up logical shards (or slices) which make up a collection. || - ||'''Node''':|| A single instance of Solr. A single Solr instance can have multiple SolrCores that can be part of any number of collections.|| + ||'''Node''': ||A single instance of Solr. A single Solr instance can have multiple SolrCores that can be part of any number of collections. || - ||'''Cluster''':|| All of the nodes you are using to host SolrCores.|| + ||'''Cluster''': ||All of the nodes you are using to host SolrCores. || + == FAQ == - * '''Q:''' I'm seeing lot's of session timeout exceptions - what to do? - '''A:''' Try raising the ZooKeeper session timeout by editing solr.xml - see the zkClientTimeout attribute. The minimum session timeout is 2 times your ZooKeeper defined tickTime. The maximum is 20 times the tickTime. The default tickTime is 2 seconds. + . '''A:''' Try raising the ZooKeeper session timeout by editing solr.xml - see the zkClientTimeout attribute. The minimum session timeout is 2 times your ZooKeeper defined tickTime. The maximum is 20 times the tickTime. The default tickTime is 2 seconds. * '''Q:''' How do I use SolrCloud, but distribute updates myself? - '''A:''' Add the following UpdateProcessorFactory somewhere in your update chain: '''NoOpDistributingUpdateProcessorFactory''' + . '''A:''' Add the following UpdateProcessorFactory somewhere in your update chain: '''NoOpDistributingUpdateProcessorFactory''' * '''Q:''' What is the difference between a Collection and a SolrCore? - '''A:''' In classic single node Solr, a SolrCore is basically equivalent to a Collection. It presents one logical index. In SolrCloud, the SolrCore's on multiple nodes form a Collection. This is still just one logical index, but multiple SolrCores host different 'shards' of the full collection. So a SolrCore encapsulates a single physical index on an instance. A Collection is a combination of all of the SolrCores that together provide a logical index that is distributed across many nodes. + . '''A:''' In classic single node Solr, a SolrCore is basically equivalent to a Collection. It presents one logical index. In SolrCloud, the SolrCore's on multiple nodes form a Collection. This is still just one logical index, but multiple SolrCores host different 'shards' of the full collection. So a SolrCore encapsulates a single physical index on an instance. A Collection is a combination of all of the SolrCores that together provide a logical index that is distributed across many nodes.