You can use rsync to automatically only transfer the files that have
changed. I don't think you'll have to home grow your own 'only transfer
the diffs' solution, I think rsync will do that for you.
But yes, running an optimization, after many updates/deletes, will
generally mean nearly everything has changed.
Solr's index, of course _is_ lucene, so your experience with lucene will
be applicable to Solr. Unless lucene or Solr have added new features
since you last used it, but you're still using lucene, when you're using
Solr.
On 8/9/2011 11:22 AM, Peter Kritikos wrote:
Hello, everyone,
My company will be using Solr on the server appliance we deliver to
our clients. We would like to maintain remote backups of clients'
search indexes to avoid rebuilding a large index when an appliance fails.
One of our clients backs up their data onto a remote server provided
by a vendor which only provides storage space, so I don't believe it
is possible for us to set up a remote slave server to use Solr's
replication functionality. Because our client has a low-bandwidth
connection to their backup server, we would like to minimize the
amount of data transferred to the remote machine. Our Solr index
receives commits every few minutes and will probably be optimized
roughly once a day. Does our frequently modified index allow us to
transfer an amount of data proportional to the number of new documents
added to the search index daily? From my understanding, optimizing an
index makes very significant changes to its files. Is there a way
around this that I may be missing?
We have faced this problem in the past when our product used a
Lucene-based search engine. We were unable to find a solution where we
could only copy the "diffs" introduced to the index since the most
recent backup, so we opted to make our indexing process faster. In
addition to plain text, many of the documents that we are indexing are
binary, e.g. Word, PDF. We cached the extracted text from these binary
documents on the clients' backup servers, saving us the cost of
extraction at index time. If we must pursue a solution like this for
Solr, how else might we optimize the indexing process?
Much appreciated,
Peter Kritikos