You can check the Hadoop job's counters to see how many are being deleted. If some are, then -deleteGone is on in your case. Only with that setting documents are going to be deleted.
-----Original message----- > From:Michael Coffey <[email protected]> > Sent: Monday 2nd October 2017 21:51 > To: User <[email protected]> > Subject: deletions from index > > With my new news crawl, I would like to keep web pages in the index, even > after they have disappeared from the web, so I can continue using them in > machine-learning processes. I thought I could achieve this by avoiding > running cleaning jobs. However, I still notice increasing numbers of > deletions in my solr index. > When and why does nutch tell the indexer to delete documents, other than > during cleaningJob? > For example, recently, Solr tells me that numDocs is about 189,000 and > deletedDocs is about 96,000. Even if I assume that some of the "deleted" docs > have just been replaced by newer content, I am not ready to believe that has > happened to so many of them. > Should I use a different indexer, or different settings, or something other > than an indexer for this purpose? >

