Sorry to jump into this discussion. I also get confused whenever I see this
strange Solr/Lucene behaviour. Probably, As @Erick said in his last year
talk, this is how it has been designed to avoid many problems that are
hard/impossible to solve.

That said, one more time I want to come back to the same question: why
solr/lucene can not handle this when we are updating all the documents?
Let's take a couple of examples :-

*Ex 1:*
Let's say I have only 10 documents in my index and all of them are in a
single segment (Segment 1). Now, I change the schema (update field type in
this case) and reindex all of them.
This is what (according to me) should happen internally :-

1st update req : Solr will mark 1st doc as deleted and index it again
(might run the analyser chain based on config)
2nd update req : Solr will mark 2st doc as deleted and index it again
(might run the analyser chain based on config)
And so on......
based on autoSoftCommit/autoCommit configuration, all new documents will be
indexed and probably flushed to disk as part of new segment (Segment 2)


Now, whenever segment merging happens (during commit or later in time),
lucene will create a new segment (Segment 3) can discard all the docs
present in segment 1 as there are no live docs in it. And there would *NOT*
be any situation to decide whether to choose the old config or new config
as there is not even a single live document with the old config. Isn't it?

*Ex 2:*
I see that it can be an issue if we think about reindexing millions of
docs. Because in that case, merging can be triggered when indexing is half
way through, and since there are some live docs in the old segment (with
old cofig), things will blow up. Please correct me if I am wrong.

I am *NOT* a Solr/Lucene expert and just started learning the ways things
are working internally. In the above example, I can be wrong at many
places. Can someone confirm if scenarios like Ex-2 are the reasons behind
the fact that even re-indexing all documents doesn't help if some
incompatible schema changes are done?  Any other insight would also be
helpful.

Thanks,
Vinay

On Sat, Oct 17, 2020 at 5:48 AM Shawn Heisey <apa...@elyograg.org> wrote:

> On 10/16/2020 2:36 PM, David Hastings wrote:
> > sorry, i was thinking just using the
> > <delete><query>*:*</query></delete>
> > method for clearing the index would leave them still
>
> In theory, if you delete all documents at the Solr level, Lucene will
> delete all the segment files on the next commit, because they are empty.
>   I have not confirmed with testing whether this actually happens.
>
> It is far safer to use a new index as Erick has said, or to delete the
> index directories completely and restart Solr ... so you KNOW the index
> has nothing in it.
>
> Thanks,
> Shawn
>

Reply via email to