You can use rsync to automatically only transfer the files that have changed. I don't think you'll have to home grow your own 'only transfer the diffs' solution, I think rsync will do that for you.

But yes, running an optimization, after many updates/deletes, will generally mean nearly everything has changed.

Solr's index, of course _is_ lucene, so your experience with lucene will be applicable to Solr. Unless lucene or Solr have added new features since you last used it, but you're still using lucene, when you're using Solr.

On 8/9/2011 11:22 AM, Peter Kritikos wrote:
Hello, everyone,

My company will be using Solr on the server appliance we deliver to our clients. We would like to maintain remote backups of clients' search indexes to avoid rebuilding a large index when an appliance fails.

One of our clients backs up their data onto a remote server provided by a vendor which only provides storage space, so I don't believe it is possible for us to set up a remote slave server to use Solr's replication functionality. Because our client has a low-bandwidth connection to their backup server, we would like to minimize the amount of data transferred to the remote machine. Our Solr index receives commits every few minutes and will probably be optimized roughly once a day. Does our frequently modified index allow us to transfer an amount of data proportional to the number of new documents added to the search index daily? From my understanding, optimizing an index makes very significant changes to its files. Is there a way around this that I may be missing?

We have faced this problem in the past when our product used a Lucene-based search engine. We were unable to find a solution where we could only copy the "diffs" introduced to the index since the most recent backup, so we opted to make our indexing process faster. In addition to plain text, many of the documents that we are indexing are binary, e.g. Word, PDF. We cached the extracted text from these binary documents on the clients' backup servers, saving us the cost of extraction at index time. If we must pursue a solution like this for Solr, how else might we optimize the indexing process?

Much appreciated,
Peter Kritikos


Reply via email to