On 10/13/21 6:51 AM, Michael Conrad wrote:
<rant>
Not everyone has the luxury of being able reindex from scratch with
data they might not have copies of anymore, say, for copyright
reasons, or because of space constraints that can't be alleviated, or
the size of collection make it unrealistic in time...
</rant>
Completely understandable, but also problematic.
With any Lucene-based software, including Solr, reindexing is REQUIRED
after many config changes, and it is highly recommended on ANY upgrade,
even to a new minor version in the same major release. Because of this,
it is strongly recommended that the source data is always accessible for
building the Solr index from scratch.
I once wrote a page on the Solr wiki about reindexing. Some of that
information, plus more that I didn't get written down, has been
incorporated into the Solr Reference Guide:
https://solr.apache.org/guide/8_9/reindexing.html
One thing from my wiki page that did NOT make it into the reference
guide is the idea of using a separate Solr install to act as an
intermediary that just stores the data, doesn't make it searchable --
and using that Solr install as a source for reindexing. This paradigm
is being used successfully in the wild.
Indexing speed is another reason to avoid reindexes. Indexing hundreds
of millions of documents (or more) is going to take a while even when
indexing speed is highly optimized.
Here is that wiki page that I wrote quite a while ago:
https://cwiki.apache.org/confluence/display/SOLR/HowToReindex
Thanks,
Shawn