[Solr Wiki] Trivial Update of "CollectionDistribution" by HossMan

Apache Wiki Sun, 26 Feb 2006 13:17:10 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by HossMan:
http://wiki.apache.org/solr/CollectionDistribution

The comment on the change is:
Solr is not an acronym

------------------------------------------------------------------------------
- = Collection Distribution =
- 
  /!\ :TODO: /!\ update script / config paths based on final packaging strategy.
  
- SOLR distribution is similar in concept to database replication.  All 
collection changes come to one master SOLR server. All production queries are 
done against query slaves. Query slaves receive all their collection changes 
indirectly &#151; as new versions of a collection which they pull from the 
master.
+ Solr distribution is similar in concept to database replication.  All 
collection changes come to one master Solr server. All production queries are 
done against query slaves. Query slaves receive all their collection changes 
indirectly &#151; as new versions of a collection which they pull from the 
master.
  These collection downloads are polled for on a cron'd basis.
  
  A collection is a directory of many files.  Collections are distributed to 
the slaves as snapshots of these files.  Each snapshot is made up of hard links 
to the files so copying of the actual files is not necessary.  Lucene only 
''significantly'' rewrites files following an optimization command.  Generally, 
a file once written, will change very little if at all.  This makes the 
underlying transport of rsynch very useful.  Files that have already been 
transfered and have not changed do not need to be re-transferred with the new 
edition of a collection.
@@ -14, +12 @@

  == Terminology ==
  
  ||'''Term'''||'''Definition'''||
- ||Collection||A Lucene collection is a directory of files.  These comprise 
the indexed and returnable data of a SOLR search repository.||
+ ||Collection||A Lucene collection is a directory of files.  These comprise 
the indexed and returnable data of a Solr search repository.||
  ||Distribution||The copying of a collection from the master to all slaves. 
Distribution of a collection from the master to all slaves takes advantage of 
Lucene's index file structure. (This same feature enables fast incremental 
indexing in Lucene.)||
  ||Inserts and Deletes||As inserts and deletes occur in the collection the 
directory remains unchanged. Documents are always inserted into newly created 
files.  Documents that are deleted are not removed from the files. They are 
flagged in the file, '''deletable''', and are not removed from the files until 
the collection is '''optimized'''.||
- ||Master & Slave||The Solr distribution system uses the master/slave model.  
The master is the service which receives all updates initially and keeps 
everything organized. SOLR uses a single update master server coupled with 
multiple query slave servers. All changes (such as inserts, updates, deletes, 
etc.) are made against the single master server. Changes made on the master are 
distributed to all the slave servers which service all query requests from the 
clients.||
+ ||Master & Slave||The Solr distribution system uses the master/slave model.  
The master is the service which receives all updates initially and keeps 
everything organized. Solr uses a single update master server coupled with 
multiple query slave servers. All changes (such as inserts, updates, deletes, 
etc.) are made against the single master server. Changes made on the master are 
distributed to all the slave servers which service all query requests from the 
clients.||
- ||Update||An update is a single change request against a single SOLR 
instance.  It may be a request to delete a document, add a new document, change 
a document, delete all documents matching a query, etc.  Updates are handled 
synchronously within an individual SOLR instance.||
+ ||Update||An update is a single change request against a single Solr 
instance.  It may be a request to delete a document, add a new document, change 
a document, delete all documents matching a query, etc.  Updates are handled 
synchronously within an individual Solr instance.||
  ||Optimization||A Lucene collection must be optimized periodically to 
maintain query performance. Optimization is run on the master server only. An 
optimized index will give you a performance gain at query time of ''at least'' 
10%.  This gain may be more on an index that has become fragmented over a 
period of time with many updates and no optimizations. Optimizations require a 
'''much''' longer time than does the distribution of an optimized collection to 
all slaves.  During optimization, a collection is compacted and all segements 
are merged. New secondary segment(s) are created to contain documents inserted 
into the collection after it has been optimized.||
  ||Segments||The number of files in a collection.||
  ||mergeFactor||Controls the number of files (segments). For example, when 
mergeFactor is set to 2, existing documents remain in the main segment, while 
all updates are written to a single secondary segment.||
@@ -26, +24 @@

  
  == The Snapshot and Distribution Process ==
  
- These are the sequential steps that SOLR runs:
+ These are the sequential steps that Solr runs:
  
-    1. The snapshooter command takes snapshots of the collection on the 
master.  It runs when invoked by SOLR after it has done a commit or an optimize.
+    1. The snapshooter command takes snapshots of the collection on the 
master.  It runs when invoked by Solr after it has done a commit or an optimize.
     1. The snappuller command runs on the query slaves to pull the newest 
snapshot from the master. This is done via rsync in daemon mode running on the 
master so that it does not need to go through ssh compression, thereby saving a 
large amount of time and CPU cycles.
-    1. The snapinstaller runs on the slave after a snapshot has been pulled 
from the master. This signals the local SOLR server to open a new index reader, 
then  auto-warming of the cache(s) begins (in the new reader), while other 
requests continue to be served by the original index reader.  Once auto-warming 
is complete, SOLR retires the old reader and directs all new queries to the 
newly cache-warmed reader.
+    1. The snapinstaller runs on the slave after a snapshot has been pulled 
from the master. This signals the local Solr server to open a new index reader, 
then  auto-warming of the cache(s) begins (in the new reader), while other 
requests continue to be served by the original index reader.  Once auto-warming 
is complete, Solr retires the old reader and directs all new queries to the 
newly cache-warmed reader.
     1. All distribution activity is logged and written back to the master to 
be viewable on the distribution page of its GUI.
     1.  Old versions of the index are removed from the master and slave 
servers by a cron'd snapcleaner.
  
@@ -48, +46 @@

     * Taking a snapshot is very fast as well.  
  
  
- == SOLR Distribution Scripts ==
+ == Solr Distribution Scripts ==
  
     * The name of the index directory is defined in the configuration file, 
web.external.xml. [[BR]] /!\ :TODO: /!\ revise
-    * All Solr collection distribution scripts are installed by the RPM 
'''solr-tools''' and reside in the directory '''scripts/solr''' of each 
instance of SOLR. [[BR]] /!\ :TODO: /!\ revise
+    * All Solr collection distribution scripts are installed by the RPM 
'''solr-tools''' and reside in the directory '''scripts/solr''' of each 
instance of Solr. [[BR]] /!\ :TODO: /!\ revise
     * Collection distribution scripts create and prepare for distribution a 
snapshot of a search collection after each '''commit''' and '''optimize''' 
request.
     * The '''snapshooter''' script creates a directory 
''snapshot.&#60;ts&#62;'', where &#60;ts&#62; is a timestamp in the format, 
yyyymmddHHMMSS.  It contains hard links to the data files.
     * Snapshots are distributed from the master server when the slaves pull 
them, "smartcopying" the snapshot directory that contains the hard links to the 
most recent collection data files.  
-    * For usage arguments and syntax see ["SOLRCollectionDistributionScripts"]
+    * For usage arguments and syntax see ["SolrCollectionDistributionScripts"]
  
  ||'''Name'''||'''Description'''||
  ||snapshooter||Creates a snapshot of a collection. Snapshooter takes no 
arguments as it always applies to the most recent snapshot. Snapshooter runs 
only on the Master server when a commit happens. Snapshooter can also be run 
manually.||
  ||snappuller|| A shell script that runs as a cron job on a slave server. The 
script looks for new snapshots on the master server and pulls them. ||
  ||snappuller-enable||Creates the file, '''snappuller-enabled''', whose 
presence enables the snappuller.||
- ||snapinstaller||Snapinstaller installs the latest snapshot (determined by 
the timestamp) into the file, logs/snapshot.current, using hard links (similar 
to the process of taking a snapshot). Then snapshot.current is written and scp 
(secure copied) back to the master server. Snapinstaller then triggers SOLR to 
open a new Searcher.||
+ ||snapinstaller||Snapinstaller installs the latest snapshot (determined by 
the timestamp) into the file, logs/snapshot.current, using hard links (similar 
to the process of taking a snapshot). Then snapshot.current is written and scp 
(secure copied) back to the master server. Snapinstaller then triggers Solr to 
open a new Searcher.||
  ||snapcleaner||Runs as a cron job to remove snapshot directories more than 
seven days old. Also can be run manually.||
- ||rsyncd-start||Starts the rsyncd daemon on the master SOLR server which 
handles collection distribution requests from the slaves.||
+ ||rsyncd-start||Starts the rsyncd daemon on the master Solr server which 
handles collection distribution requests from the slaves.||
  ||rsyncd daemon|| Efficiently synchronizes a collection&#151;between master 
and slaves&#151;by copying only the files that actually changed. In addition, 
rsync can compress data before transmitting it. ||
- ||rsyncd-stop||Stops the rsyncd daemon on the master SOLR server. The stop 
script then makes sure that the daemon has in fact exited by trying to connect 
to it for up to 300 seconds. The stop script exits with ''error code 2'' if it 
fails to stop the rsyncd daemon.||
+ ||rsyncd-stop||Stops the rsyncd daemon on the master Solr server. The stop 
script then makes sure that the daemon has in fact exited by trying to connect 
to it for up to 300 seconds. The stop script exits with ''error code 2'' if it 
fails to stop the rsyncd daemon.||
  ||rsyncd-enable||Creates the file, rsyncd-enabled, whose presence allows the 
rsyncd daemon to run, allowing replication to occur.||
  ||rsyncd-disable||Removes the file, rsyncd-enabled, whose absence prevents 
the rsyncd daemon from running, preventing replication. ||
  
@@ -76, +74 @@

         -v          increase verbosity
  }}}
  
- ''snapshooter'' uses the configuration file, '''conf/distribution.conf''', to 
determine if the SOLR server is a master or slave. snapshooter is disabled on 
all slave SOLR servers. When invoked on a slave server, snapshooter displays an 
error message and exits with error code 1 without taking a snapshot.
+ ''snapshooter'' uses the configuration file, '''conf/distribution.conf''', to 
determine if the Solr server is a master or slave. snapshooter is disabled on 
all slave Solr servers. When invoked on a slave server, snapshooter displays an 
error message and exits with error code 1 without taking a snapshot.
  
  === rsyncd-enable ===
  
@@ -85, +83 @@

         -v          increase verbosity
  }}}
  
- ''rsyncd_enable'' enables the starting of the rsyncd daemon by creating the 
file rsyncd-enabled in the top level directory of the SOLR installation (for 
example, /var/opt/resin3/7000).  Please note that this script will not actually 
starts the rsyncd daemon.
+ ''rsyncd_enable'' enables the starting of the rsyncd daemon by creating the 
file rsyncd-enabled in the top level directory of the Solr installation (for 
example, /var/opt/resin3/7000).  Please note that this script will not actually 
starts the rsyncd daemon.
  
  === rsyncd-disable ===
  
@@ -94, +92 @@

         -v          increase verbosity
  }}}
  
- ''rsyncd-disable'' disables the starting of the rsyncd daemon by removing the 
file, rsyncd-enabled, from the top level directory of the SOLR installation 
(for example, /var/opt/resin3/7000).  Please note that this script will not 
actually stop the rsyncd daemon if it is already running.
+ ''rsyncd-disable'' disables the starting of the rsyncd daemon by removing the 
file, rsyncd-enabled, from the top level directory of the Solr installation 
(for example, /var/opt/resin3/7000).  Please note that this script will not 
actually stop the rsyncd daemon if it is already running.
  
  === rsyncd-start ===
  
@@ -102, +100 @@

  usage: rsyncd-start
  }}}
  
- Starts the rsyncd daemon on the master SOLR server. The rsyncd daemon sets 
its port number to be the port number of the master SOLR server incremented by 
10000. For example, if the master SOLR server runs at port 7000, then its  
rsyncd daemon runs at port 17000. The start script is synchronous. After 
starting the rsyncd daemon, it will attempt to connect to it for up to 15 
seconds. The start script will exit with error code 2 if it fails to connect to 
the rsyncd daemon. The configuration of the rsyncd daemon is controlled by the 
file, conf/rsyncd.conf. The process ID of the rsyncd daemon is written into the 
file, logs/rsyncd.pid. Output of the rsyncd daemon is written to the file, 
logs/rsyncd.log.
+ Starts the rsyncd daemon on the master Solr server. The rsyncd daemon sets 
its port number to be the port number of the master Solr server incremented by 
10000. For example, if the master Solr server runs at port 7000, then its  
rsyncd daemon runs at port 17000. The start script is synchronous. After 
starting the rsyncd daemon, it will attempt to connect to it for up to 15 
seconds. The start script will exit with error code 2 if it fails to connect to 
the rsyncd daemon. The configuration of the rsyncd daemon is controlled by the 
file, conf/rsyncd.conf. The process ID of the rsyncd daemon is written into the 
file, logs/rsyncd.pid. Output of the rsyncd daemon is written to the file, 
logs/rsyncd.log.
  
  === rsyncd-stop ===
  
@@ -110, +108 @@

  usage: rsyncd-stop
  }}}
  
- Stops the rsyncd daemon on the master SOLR server. The stop script is 
synchronous. After stopping the rsyncd daemon, it makes sure that the daemon 
has exited by trying to connect to it for up to 300 seconds. The stop script 
will exit with error code 2 if it fails to stop the rsyncd daemon.
+ Stops the rsyncd daemon on the master Solr server. The stop script is 
synchronous. After stopping the rsyncd daemon, it makes sure that the daemon 
has exited by trying to connect to it for up to 300 seconds. The stop script 
will exit with error code 2 if it fails to stop the rsyncd daemon.
  
  === snappuller-enable ===
  
@@ -119, +117 @@

         -v          increase verbosity
  }}}
  
- ''snappuller_enable'' enables the snappuller by creating the file, 
snappuller-enabled, in the top level directory of the SOLR installation (for 
example, /var/opt/resin3/7000).
+ ''snappuller_enable'' enables the snappuller by creating the file, 
snappuller-enabled, in the top level directory of the Solr installation (for 
example, /var/opt/resin3/7000).
  
  === snappuller-disable ===
  
@@ -128, +126 @@

         -v          increase verbosity
  }}}
  
- ''snappuller-disable'' disables the snappuller by removing the file, 
snappuller-enabled, from the top level directory of the SOLR installation (for 
example, /var/opt/resin3/7000).
+ ''snappuller-disable'' disables the snappuller by removing the file, 
snappuller-enabled, from the top level directory of the Solr installation (for 
example, /var/opt/resin3/7000).
  
  === snappuller ===
  
@@ -142, +140 @@

         -z          enable compression of data
  }}}
  
- ''snappuller'' gets the hostname and port number of the master SOLR server 
from the file conf/distribution.conf. The values in conf/distribution.conf are 
overwritten by the command line options, -m and -p ,if they are present.
+ ''snappuller'' gets the hostname and port number of the master Solr server 
from the file conf/distribution.conf. The values in conf/distribution.conf are 
overwritten by the command line options, -m and -p ,if they are present.
  
- If snappuller has been disabled, it will log an appropriate message in its 
log file, and then exit without pulling any snapshot from the master SOLR 
server.
+ If snappuller has been disabled, it will log an appropriate message in its 
log file, and then exit without pulling any snapshot from the master Solr 
server.
  
- If the name of the snapshot to be pull is not specified by the use of the 
"-n" option, snappuller will use ssh to determine the name of the most recent 
snapshot available on the master SOLR server and pull it over if it does not 
already exist on the slave SOLR server.
+ If the name of the snapshot to be pull is not specified by the use of the 
"-n" option, snappuller will use ssh to determine the name of the most recent 
snapshot available on the master Solr server and pull it over if it does not 
already exist on the slave Solr server.
  
- The status and stats of the current or most recent rsync operation of 
snappuller is kept in the file, logs/snappuller.status. Whenever this file is 
updated by snappuller, a copy is scp back to the master SOLR server. See 
Distribution Status and Stats for more details.
+ The status and stats of the current or most recent rsync operation of 
snappuller is kept in the file, logs/snappuller.status. Whenever this file is 
updated by snappuller, a copy is scp back to the master Solr server. See 
Distribution Status and Stats for more details.
  
  === snapinstaller ===
  
@@ -159, +157 @@

         -v          increase verbosity
  }}}
  
- ''snapinstaller'' gets the hostname and port number of the master SOLR server 
from the file conf/distribution.conf. The values in conf/distribution.conf are 
overwritten by the command line options, -m and -p, if they are present.
+ ''snapinstaller'' gets the hostname and port number of the master Solr server 
from the file conf/distribution.conf. The values in conf/distribution.conf are 
overwritten by the command line options, -m and -p, if they are present.
  
- After a snapshot has been installed, snapinstaller writes its name into the 
file, logs/snapshot.current, and scp a copy of this file back to the master 
SOLR server. See Distribution Status and Stats for more details.
+ After a snapshot has been installed, snapinstaller writes its name into the 
file, logs/snapshot.current, and scp a copy of this file back to the master 
Solr server. See Distribution Status and Stats for more details.
  
  === snapcleaner ===
  
@@ -173, +171 @@

         -v           increase verbosity
  }}}
  
- == SOLR Distribution related Cron Jobs ==
+ == Solr Distribution related Cron Jobs ==
  
  The distribution process is automated via the use of cron jobs.
- The cron jobs should run under the user id that the SOLR application is
+ The cron jobs should run under the user id that the Solr application is
  running under.
  
  === snapcleaner ===
  
  ''snapcleaner'' should be run out of cron at the regular basis to clean up
- old snapshots.  This should be done on both the SOLR master and slave 
servers.  
+ old snapshots.  This should be done on both the Solr master and slave 
servers.  
  For example, the following cron job runs everyday at midnight and cleans up 
snapshots 8 days and older:
  
  {{{
@@ -193, +191 @@

  
  === snappuller and snapinstaller ===
  
- On the SOLR slave servers, ''snappuller'' should be run out of cron 
regularily to get the latest index from the master server.  It is a good idea 
to also run ''snapinstaller'' with ''snappuller'' back-to-back in the same 
crontab entry to install the latest index once it has been copied over to the 
slave.  For example, the following cron job runs every 5 minutes to keep the 
slave server in sync with the master:
+ On the Solr slave servers, ''snappuller'' should be run out of cron 
regularily to get the latest index from the master server.  It is a good idea 
to also run ''snapinstaller'' with ''snappuller'' back-to-back in the same 
crontab entry to install the latest index once it has been copied over to the 
slave.  For example, the following cron job runs every 5 minutes to keep the 
slave server in sync with the master:
  
  {{{
  0,5,10,15,20,25,30,35,40,45,50,55 * * * * 
/var/opt/resin3/5051/scripts/solr/snappuller;/var/opt/resin3/5051/scripts/solr/snapinstaller

[Solr Wiki] Trivial Update of "CollectionDistribution" by HossMan

Reply via email to