Hello, everyone,

My company will be using Solr on the server appliance we deliver to our clients. We would like to maintain remote backups of clients' search indexes to avoid rebuilding a large index when an appliance fails.

One of our clients backs up their data onto a remote server provided by a vendor which only provides storage space, so I don't believe it is possible for us to set up a remote slave server to use Solr's replication functionality. Because our client has a low-bandwidth connection to their backup server, we would like to minimize the amount of data transferred to the remote machine. Our Solr index receives commits every few minutes and will probably be optimized roughly once a day. Does our frequently modified index allow us to transfer an amount of data proportional to the number of new documents added to the search index daily? From my understanding, optimizing an index makes very significant changes to its files. Is there a way around this that I may be missing?

We have faced this problem in the past when our product used a Lucene-based search engine. We were unable to find a solution where we could only copy the "diffs" introduced to the index since the most recent backup, so we opted to make our indexing process faster. In addition to plain text, many of the documents that we are indexing are binary, e.g. Word, PDF. We cached the extracted text from these binary documents on the clients' backup servers, saving us the cost of extraction at index time. If we must pursue a solution like this for Solr, how else might we optimize the indexing process?

Much appreciated,
Peter Kritikos

Reply via email to