On 10/17/2020 6:23 AM, Vinay Rajput wrote:
That said, one more time I want to come back to the same question: why
solr/lucene can not handle this when we are updating all the documents?
Let's take a couple of examples :-

*Ex 1:*
Let's say I have only 10 documents in my index and all of them are in a
single segment (Segment 1). Now, I change the schema (update field type in
this case) and reindex all of them.
This is what (according to me) should happen internally :-

1st update req : Solr will mark 1st doc as deleted and index it again
(might run the analyser chain based on config)
2nd update req : Solr will mark 2st doc as deleted and index it again
(might run the analyser chain based on config)
And so on......
based on autoSoftCommit/autoCommit configuration, all new documents will be
indexed and probably flushed to disk as part of new segment (Segment 2)

<snip>

*Ex 2:*
I see that it can be an issue if we think about reindexing millions of
docs. Because in that case, merging can be triggered when indexing is half
way through, and since there are some live docs in the old segment (with
old cofig), things will blow up. Please correct me if I am wrong.

If you could guarantee a few things, you could be sure this will work. But it's a serious long shot.

The change in schema might be such that when Lucene tries to merge them, it fails because the data in the old segments is incompatible with the new segments. If that happens, then you're sunk ... it won't work at all.

If the merges of old and new segments are successful, then you would have to optimize the index after you're done indexing to be SURE there were no old documents remaining. Lucene calls that operation "ForceMerge". This operation is disruptive and can take a very long time.

You would also have to be sure there was no query activity until the update/merge is completely done. Which probably means that you'd want to work on a copy of the index in another collection. And if you're going to do that, you might as well start indexing from scratch into a new/empty collection. That would also allow you to continue querying the old collection until the new one was ready.

Thanks,
Shawn

Reply via email to