I can tell you how we did it, since we have a similar problem. In our case, the indexer is a crawler which fetches the data from the REST API of our website. It is a program which comes loaded with all the knowledge of the organisation of the data in the website as well as in the Solr index.
Handling this on application level gives you a finer control over everything. The crawler could slurp the filenames of documents which are already in the index into its cache before fetching the source. While crawling, the crawler should compare the documents of the source with the documents in its own cache. No need to struggle with the configuration of Solr on that. Mag.phil. Robert Ehrenleitner, BEng. Paris-Lodron-University of Salzburg ________________________________ Von: Mikhail Khludnev <m...@apache.org> Gesendet: Montag, 19. Mai 2025 15:04 An: users@solr.apache.org <users@solr.apache.org> Betreff: Re: Using ExtractRequest handler to index documents using type_leve=parent [Sie erhalten nicht häufig E-Mails von m...@apache.org. Weitere Informationen, warum dies wichtig ist, finden Sie unter https://aka.ms/LearnAboutSenderIdentification ] Hello Sergio, I don't think that adding nested docs into indexes with standalones is ever supported or considered. Right. Please check bold here https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsolr.apache.org%2Fguide%2Fsolr%2Flatest%2Findexing-guide%2Findexing-nested-documents.html%23schema-configuration&data=05%7C02%7Crobert.ehrenleitner%40plus.ac.at%7Ca2cd6bb9f9f84e689ad408dd96d5e1b7%7C158a941a576e4e87993db2eab8526e50%7C1%7C0%7C638832567570034930%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=FjbzoYu%2FgqI3hMwfuX9zgSBgjBdtyAN0bKQghLTmBXA%3D&reserved=0<https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-nested-documents.html#schema-configuration> On Mon, May 19, 2025 at 3:29 PM Sergio García Maroto <marot...@gmail.com> wrote: > Thnaks for you quick response. Let me elaborate a bit more. > There is an existing index I have in production where I reindex documents > all the time using the DocID unique field. > When a new request comes in using /update/extract that documents gets > reindexed and replace with new data. No problem with that. > > Once I added the _root_ and type_level fields to the schema old existing > document in Solr stays the same and a new document gets created. > If I reindex again the same document. The new one the one gets reindexed > and rewrite but the old one still there. > > I have the feeeling there is an issue with update/extrcat firs time you add > _root_ and type_level to the schema. It doesn't understand the old document > and the new one are the same. > > This forces me to delte index and do a reindexation for scatch. Or reindex > all documents and later at then end delte the old one don't have > type_level:parent. > > Any ideas on this? > > > > > > On Mon 19 May 2025 at 13:40, Ehrenleitner Robert Harald < > robert.ehrenleit...@plus.ac.at> wrote: > > > Hi, > > > > what exactly do you mean by "my document appears twice"? A document can > > appear a hundred times if all the entries differ only by the ID. Make > sure > > your indexer takes care of this. Also, your unique ID field is "DocID", > and > > according to your sample, its value seems to match ID. Make sure, it > always > > matches, otherwise it is handled like a compound primary key in a SQL > > database (actually, Solr's DB is a no-SQL database, but this only > concerns > > the way the data is queried). > > > > Also, make sure your query does not confuse parent ID and ID in some way. > > This could happen due to a bug in the querying application. > > > > Mag.phil. Robert Ehrenleitner, BEng. > > -- > > > > Mag.phil. Robert Ehrenleitner, BEng. > > > > Web-Developer > > > > IT-Services | Application & Digitalization Services > > > > Hellbrunner Straße 34 | 5020 Salzburg | Austria > > > > Tel.: +43/(0)662/8044 - 6778 > > > > *www.plus.ac.at <http://www.plus.ac.at<http://www.plus.ac.at/>>* > > > > > > ------------------------------ > > *Von:* Sergio García Maroto <marot...@gmail.com> > > *Gesendet:* Montag, 19. Mai 2025 13:00 > > *An:* solr-user <solr-u...@lucene.apache.org> > > *Betreff:* Using ExtractRequest handler to index documents using > > type_leve=parent > > > > Hi, > > > > I have been indexing documents for a long time usign /update/extract. > > Everyhting has been working well until I got a new requirement to add > > nested documents > > > > I added to schema.xml > > <field name="type_level" type="string" indexed="true" stored="true" > > docValues="true" /> <field name="_root_" type="string" indexed="true" > > stored > > ="true" multiValued="false" required="false" /> > > > > My unique field > > <field name="DocID" type="string" indexed="true" stored="true" /> > > <uniqueKey>DocID</uniqueKey> > > > > Ater doing this my reques to /update/extract to reindex the same document > > duplicates the document in SOlr. > > Here my request. I only changed the new parametes type_level:parent > > > > http://server:8983/solr/document/update/extract? > > literal.id=6584239& > > resource.name=& > > wt=xml& > > literal.DocID=6584239& > > literal.CoreID=6584239& > > literal.DocIsAttachToPNB=False& > > literal.DocAuthorID=1455& > > literal.DocIsAttachToPerson=True& > > literal.DocIsAttachToAssign=False& > > literal.DocIsAttachToCompany=False& > > literal.DocVersionID=4504527& > > literal.InsertDateSD=2011-01-03T07%3a51%3a00.0Z& > > literal.DocNameS=Squires+David+RES.doc& > > literal.DocCateNameS=Resume%2fCV& > > literal.DocAreaCateNameS=Person+Module& > > literal.type_level=parent& > > > > > stream.url=http%3a%2f%2flocalhost%3a8081%2f4%2f50%2f45%2fSquires%2520David%2520RES15EAC416-AF05-4D38-A4F9-7B489962C167.docx& > > overwrite=true& > > commit=true > > > > After this request the document appear duplicated. the only difference > > between the old and new one is type_level:parent. > > > > Anyone has any idea why this is happening. > > > > Regads, > > Sergio Maroto > > > -- Sincerely yours Mikhail Khludnev