Hello, everyone,
My company will be using Solr on the server appliance we deliver to our
clients. We would like to maintain remote backups of clients' search
indexes to avoid rebuilding a large index when an appliance fails.
One of our clients backs up their data onto a remote server provided by
a vendor which only provides storage space, so I don't believe it is
possible for us to set up a remote slave server to use Solr's
replication functionality. Because our client has a low-bandwidth
connection to their backup server, we would like to minimize the amount
of data transferred to the remote machine. Our Solr index receives
commits every few minutes and will probably be optimized roughly once a
day. Does our frequently modified index allow us to transfer an amount
of data proportional to the number of new documents added to the search
index daily? From my understanding, optimizing an index makes very
significant changes to its files. Is there a way around this that I may
be missing?
We have faced this problem in the past when our product used a
Lucene-based search engine. We were unable to find a solution where we
could only copy the "diffs" introduced to the index since the most
recent backup, so we opted to make our indexing process faster. In
addition to plain text, many of the documents that we are indexing are
binary, e.g. Word, PDF. We cached the extracted text from these binary
documents on the clients' backup servers, saving us the cost of
extraction at index time. If we must pursue a solution like this for
Solr, how else might we optimize the indexing process?
Much appreciated,
Peter Kritikos