[Solr Wiki] Update of "CollectionDistribution" by HossMan

Apache Wiki Mon, 13 Feb 2006 23:24:22 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by HossMan:
http://wiki.apache.org/solr/CollectionDistribution

The comment on the change is:
initial import of PI/CollectionDistribution from CNET's wiki

New page:
= Collection Distribution =

/!\ :TODO: /!\ update script / config paths based on final packaging strategy.

SOLR distribution is similar in concept to database replication.  All 
collection changes come to one master SOLR server. All production queries are 
done against query slaves. Query slaves receive all their collection changes 
indirectly &#151; as new versions of a collection which they pull from the 
master.
These collection downloads are polled for on a cron'd basis.

A collection is a directory of many files.  Collections are distributed to the 
slaves as snapshots of these files.  Each snapshot is made up of hard links to 
the files so copying of the actual files is not necessary.  Lucene only 
''significantly'' rewrites files following an optimization command.  Generally, 
a file once written, will change very little if at all.  This makes the 
underlying transport of rsynch very useful.  Files that have already been 
transfered and have not changed do not need to be re-transferred with the new 
edition of a collection.

[[TableOfContents]]

== Terminology ==

||'''Term'''||'''Definition'''||
||Collection||A Lucene collection is a directory of files.  These comprise the 
indexed and returnable data of a SOLR search repository.||
||Distribution||The copying of a collection from the master to all slaves. 
Distribution of a collection from the master to all slaves takes advantage of 
Lucene's index file structure. (This same feature enables fast incremental 
indexing in Lucene.)||
||Inserts and Deletes||As inserts and deletes occur in the collection the 
directory remains unchanged. Documents are always inserted into newly created 
files.  Documents that are deleted are not removed from the files. They are 
flagged in the file, '''deletable''', and are not removed from the files until 
the collection is '''optimized'''.||
||Master & Slave||The Solr distribution system uses the master/slave model.  
The master is the service which receives all updates initially and keeps 
everything organized. SOLR uses a single update master server coupled with 
multiple query slave servers. All changes (such as inserts, updates, deletes, 
etc.) are made against the single master server. Changes made on the master are 
distributed to all the slave servers which service all query requests from the 
clients.||
||Update||An update is a single change request against a single SOLR instance.  
It may be a request to delete a document, add a new document, change a 
document, delete all documents matching a query, etc.  Updates are handled 
synchronously within an individual SOLR instance.||
||Optimization||A Lucene collection must be optimized periodically to maintain 
query performance. Optimization is run on the master server only. An optimized 
index will give you a performance gain at query time of ''at least'' 10%.  This 
gain may be more on an index that has become fragmented over a period of time 
with many updates and no optimizations. Optimizations require a '''much''' 
longer time than does the distribution of an optimized collection to all 
slaves.  During optimization, a collection is compacted and all segements are 
merged. New secondary segment(s) are created to contain documents inserted into 
the collection after it has been optimized.||
||Segments||The number of files in a collection.||
||mergeFactor||Controls the number of files (segments). For example, when 
mergeFactor is set to 2, existing documents remain in the main segment, while 
all updates are written to a single secondary segment.||
||Snapshot||The snapshot directory contains hard links to the data files. 
Snapshots are distributed from the master server when the slaves pull them, 
"smartcopying" the snapshot directory that contains the hard links to the most 
recent collection data files.||

== The Snapshot and Distribution Process ==

These are the sequential steps that SOLR runs:

   1. The snapshooter command takes snapshots of the collection on the master.  
It runs when invoked by SOLR after it has done a commit or an optimize.
   1. The snappuller command runs on the query slaves to pull the newest 
snapshot from the master. This is done via rsync in daemon mode running on the 
master so that it does not need to go through ssh compression, thereby saving a 
large amount of time and CPU cycles.
   1. The snapinstaller runs on the slave after a snapshot has been pulled from 
the master. This signals the local SOLR server to open a new index reader, then 
 auto-warming of the cache(s) begins (in the new reader), while other requests 
continue to be served by the original index reader.  Once auto-warming is 
complete, SOLR retires the old reader and directs all new queries to the newly 
cache-warmed reader.
   1. All distribution activity is logged and written back to the master to be 
viewable on the distribution page of its GUI.
   1.  Old versions of the index are removed from the master and slave servers 
by a cron'd snapcleaner.

If you are building an index from scratch, distribution is the final step of 
the process. For detailed steps, see CollectionBuilding 

Manual copying of index files is not recommended; however, running distribution 
commands manually (i.e., not relying on crond to run them) is perfectly fine.

== Snapshot Directories ==

Snapshots directories are in the format, snaphost.''yyyymmddHHMMSS''.

All the files in the index directory are hard links to the latest snapshot. 
This technique has these advantages:
   * Can keep multiple snapshots on each host without the need to keep multiple 
copies of index files that have not changed.  
   * File copying from master to slave is very fast.  
   * Taking a snapshot is very fast as well.  


== SOLR Distribution Scripts ==

   * The name of the index directory is defined in the configuration file, 
web.external.xml. [[BR]] /!\ :TODO: /!\ revise
   * All Solr collection distribution scripts are installed by the RPM 
'''solr-tools''' and reside in the directory '''scripts/solr''' of each 
instance of SOLR. [[BR]] /!\ :TODO: /!\ revise
   * Collection distribution scripts create and prepare for distribution a 
snapshot of a search collection after each '''commit''' and '''optimize''' 
request.
   * The '''snapshooter''' script creates a directory 
''snapshot.&#60;ts&#62;'', where &#60;ts&#62; is a timestamp in the format, 
yyyymmddHHMMSS.  It contains hard links to the data files.
   * Snapshots are distributed from the master server when the slaves pull 
them, "smartcopying" the snapshot directory that contains the hard links to the 
most recent collection data files.  
   * For usage arguments and syntax see ["SOLRCollectionDistributionScripts"]

||'''Name'''||'''Description'''||
||snapshooter||Creates a snapshot of a collection. Snapshooter takes no 
arguments as it always applies to the most recent snapshot. Snapshooter runs 
only on the Master server when a commit happens. Snapshooter can also be run 
manually.||
||snappuller|| A shell script that runs as a cron job on a slave server. The 
script looks for new snapshots on the master server and pulls them. ||
||snappuller-enable||Creates the file, '''snappuller-enabled''', whose presence 
enables the snappuller.||
||snapinstaller||Snapinstaller installs the latest snapshot (determined by the 
timestamp) into the file, logs/snapshot.current, using hard links (similar to 
the process of taking a snapshot). Then snapshot.current is written and scp 
(secure copied) back to the master server. Snapinstaller then triggers SOLR to 
open a new Searcher.||
||snapcleaner||Runs as a cron job to remove snapshot directories more than 
seven days old. Also can be run manually.||
||rsyncd-start||Starts the rsyncd daemon on the master SOLR server which 
handles collection distribution requests from the slaves.||
||rsyncd daemon|| Efficiently synchronizes a collection&#151;between master and 
slaves&#151;by copying only the files that actually changed. In addition, rsync 
can compress data before transmitting it. ||
||rsyncd-stop||Stops the rsyncd daemon on the master SOLR server. The stop 
script then makes sure that the daemon has in fact exited by trying to connect 
to it for up to 300 seconds. The stop script exits with ''error code 2'' if it 
fails to stop the rsyncd daemon.||
||rsyncd-enable||Creates the file, rsyncd-enabled, whose presence allows the 
rsyncd daemon to run, allowing replication to occur.||
||rsyncd-disable||Removes the file, rsyncd-enabled, whose absence prevents the 
rsyncd daemon from running, preventing replication. ||

=== snapshooter ===

   {{{
usage: snapshooter [ -v ]
       -v          increase verbosity
}}}

''snapshooter'' uses the configuration file, '''conf/distribution.conf''', to 
determine if the SOLR server is a master or slave. snapshooter is disabled on 
all slave SOLR servers. When invoked on a slave server, snapshooter displays an 
error message and exits with error code 1 without taking a snapshot.

=== rsyncd-enable ===

   {{{
usage: rsyncd-enable [ -v ]
       -v          increase verbosity
}}}

''rsyncd_enable'' enables the starting of the rsyncd daemon by creating the 
file rsyncd-enabled in the top level directory of the SOLR installation (for 
example, /var/opt/resin3/7000).  Please note that this script will not actually 
starts the rsyncd daemon.

=== rsyncd-disable ===

   {{{
usage: rsyncd-disable [ -v ]
       -v          increase verbosity
}}}

''rsyncd-disable'' disables the starting of the rsyncd daemon by removing the 
file, rsyncd-enabled, from the top level directory of the SOLR installation 
(for example, /var/opt/resin3/7000).  Please note that this script will not 
actually stop the rsyncd daemon if it is already running.

=== rsyncd-start ===

   {{{
usage: rsyncd-start
}}}

Starts the rsyncd daemon on the master SOLR server. The rsyncd daemon sets its 
port number to be the port number of the master SOLR server incremented by 
10000. For example, if the master SOLR server runs at port 7000, then its  
rsyncd daemon runs at port 17000. The start script is synchronous. After 
starting the rsyncd daemon, it will attempt to connect to it for up to 15 
seconds. The start script will exit with error code 2 if it fails to connect to 
the rsyncd daemon. The configuration of the rsyncd daemon is controlled by the 
file, conf/rsyncd.conf. The process ID of the rsyncd daemon is written into the 
file, logs/rsyncd.pid. Output of the rsyncd daemon is written to the file, 
logs/rsyncd.log.

=== rsyncd-stop ===

   {{{
usage: rsyncd-stop
}}}

Stops the rsyncd daemon on the master SOLR server. The stop script is 
synchronous. After stopping the rsyncd daemon, it makes sure that the daemon 
has exited by trying to connect to it for up to 300 seconds. The stop script 
will exit with error code 2 if it fails to stop the rsyncd daemon.

=== snappuller-enable ===

   {{{
usage: snappuller-enable [ -v ]
       -v          increase verbosity
}}}

''snappuller_enable'' enables the snappuller by creating the file, 
snappuller-enabled, in the top level directory of the SOLR installation (for 
example, /var/opt/resin3/7000).

=== snappuller-disable ===

   {{{
usage: snappuller-disable [ -v ]
       -v          increase verbosity
}}}

''snappuller-disable'' disables the snappuller by removing the file, 
snappuller-enabled, from the top level directory of the SOLR installation (for 
example, /var/opt/resin3/7000).

=== snappuller ===

   {{{
usage: snappuller -m master -p port [-n snapshot] [ -svz ]
       -m master   hostname of master server from where to pull index snapshot
       -p port     port number of master server from where to pull index 
snapshot
       -n snapshot pull a specific snapshot by name
       -s          use the --size-only option with rsync
       -v          increase verbosity (-vv show file transfer stats also)
       -z          enable compression of data
}}}

''snappuller'' gets the hostname and port number of the master SOLR server from 
the file conf/distribution.conf. The values in conf/distribution.conf are 
overwritten by the command line options, -m and -p ,if they are present.

If snappuller has been disabled, it will log an appropriate message in its log 
file, and then exit without pulling any snapshot from the master SOLR server.

If the name of the snapshot to be pull is not specified by the use of the "-n" 
option, snappuller will use ssh to determine the name of the most recent 
snapshot available on the master SOLR server and pull it over if it does not 
already exist on the slave SOLR server.

The status and stats of the current or most recent rsync operation of 
snappuller is kept in the file, logs/snappuller.status. Whenever this file is 
updated by snappuller, a copy is scp back to the master SOLR server. See 
Distribution Status and Stats for more details.

=== snapinstaller ===

   {{{
usage: snapinstaller -m master -p port [ -v ]
       -m master   hostname of master server where snapshot stats are posted
       -p port     port number of master server from where to pull index 
snapshot
       -v          increase verbosity
}}}

''snapinstaller'' gets the hostname and port number of the master SOLR server 
from the file conf/distribution.conf. The values in conf/distribution.conf are 
overwritten by the command line options, -m and -p, if they are present.

After a snapshot has been installed, snapinstaller writes its name into the 
file, logs/snapshot.current, and scp a copy of this file back to the master 
SOLR server. See Distribution Status and Stats for more details.

=== snapcleaner ===

   {{{
usage: snapcleaner -d <days> | -n <num> [ -v ]
       -d <days>    cleanup snapshots more than <days> days old
       -n <num>     keep the most most recent <num> number of snapshots and
                    cleanup up the remaining ones that are not being pulled
       -v           increase verbosity
}}}

== SOLR Distribution related Cron Jobs ==

The distribution process is automated via the use of cron jobs.
The cron jobs should run under the user id that the SOLR application is
running under.

=== snapcleaner ===

''snapcleaner'' should be run out of cron at the regular basis to clean up
old snapshots.  This should be done on both the SOLR master and slave servers.  
For example, the following cron job runs everyday at midnight and cleans up 
snapshots 8 days and older:

{{{
0 0 * * * /var/opt/resin3/5051/scripts/solr/snapcleaner -d 7
}}}

Additional cleanup can always be performed on-demand by running ''snapcleaner'' 
manually.

=== snappuller and snapinstaller ===

On the SOLR slave servers, ''snappuller'' should be run out of cron regularily 
to get the latest index from the master server.  It is a good idea to also run 
''snapinstaller'' with ''snappuller'' back-to-back in the same crontab entry to 
install the latest index once it has been copied over to the slave.  For 
example, the following cron job runs every 5 minutes to keep the slave server 
in sync with the master:

{{{
0,5,10,15,20,25,30,35,40,45,50,55 * * * * 
/var/opt/resin3/5051/scripts/solr/snappuller;/var/opt/resin3/5051/scripts/solr/snapinstaller
}}}

== Commit and Optimization ==

On a very large index, adding even a few documents then running an optimize 
means rewriting the complete index.  This consumes a lot of disk I/O and 
impacts query performace. Optimizing a very large index may even involve 
copying the index twice  &#151; the current code for merging one index into 
another calls optimize at the beginning ''and'' the end.  If some docs have 
been deleted, the first optimize call will rewrite the index even before the 
second index is merged.

Optimizations can take nearly ten minutes to run.  We do not know what happens 
to query performance on a collection that has not been optimized for a long 
time. We ''do'' know that it will get worse as the collection becomes more 
fragmented, but   how much worse is very dependent on the manner of updates and 
commits to the collection.

We are presuming optimizations should be run once following large 
''batch-like'' updates to the collection and/or once a day.

== Distribution and Optimization ==

An optimization on the master takes several minutes.  Distribution of a newly 
optimized collection takes only a little over a minute.  During optimization 
the machine is under load and does not process queries very well.  Given a 
schedule of updates being driven a few times an hour to the slaves, we cannot 
run an optimize with every committed snapshot.  We do recommend that an 
optimize be run  on the master at least once a day. 

Copying an optimized collection means that the '''entire''' collection will 
need to be transferred during the next snappull. This is a large expense, but 
not nearly as huge as running the optimize everywhere.  In a three-slave 
one-master configuration, distributing a newly-optimized collection takes 
approximately 80 seconds ''total''.  Rolling the change across a tier would 
require approximately ten minutes per machine (or machine group).  If this 
optimize were rolled across the query tier, and if each collection being 
optimized were disabled and not receiving queries, a rollout would take at 
least twenty minutes and potentially an hour and a half.  Additionally, the 
files would need to be synchronized so that the ''following'' rsynch snappull 
would not think that the independently optimized files were different in any 
way.  This would also leave the door open to independent corruption of 
collections instead of each being a perfect copy of the master.

Optimizing on the master allows for a straight-forward optimization operation.  
No query slaves need to be taken out of service.  The optimized collection can 
be distributed in the background as queries are being normally serviced.  The 
optimization can occur at any time convenient to the application providing 
collection updates.

[Solr Wiki] Update of "CollectionDistribution" by HossMan

Reply via email to