If you don't delete documents, the numDoc/maxDoc difference is just updated documents, of which the older version is eligible for deletion.
-----Original message----- > From:Michael Coffey <[email protected]> > Sent: Monday 2nd October 2017 23:29 > To: [email protected] > Subject: Re: deletions from index > > So, I had these numbers in my index: > Num Docs: 189550Max Docs: 285531 > Deleted Docs: 95981 > > Then I did a crawl and index, which told meindexed (add/update): 13,423 > And now I have these numbers in my index: > > Num Docs: 190785Max Docs: 223339Deleted Docs: 32554So, I am completely > confused. I don't use "-deleteGone" but I get massive numbers of deletions. > > Is it your theory that Solr's report of deleted docs really just means that > docs were replaced by newer content? > > > From: Markus Jelsma <[email protected]> > To: "[email protected]" <[email protected]>; User ><[email protected]> > Sent: Monday, October 2, 2017 1:19 PM > Subject: RE: deletions from index > > You can check the Hadoop job's counters to see how many are being deleted. If > some are, then -deleteGone is on in your case. Only with that setting > documents are going to be deleted. > > > > -----Original message----- > > From:Michael Coffey <[email protected]> > > Sent: Monday 2nd October 2017 21:51 > > To: User <[email protected]> > > Subject: deletions from index > > > > With my new news crawl, I would like to keep web pages in the index, even > > after they have disappeared from the web, so I can continue using them in > > machine-learning processes. I thought I could achieve this by avoiding > > running cleaning jobs. However, I still notice increasing numbers of > > deletions in my solr index. > > When and why does nutch tell the indexer to delete documents, other than > > during cleaningJob? > > For example, recently, Solr tells me that numDocs is about 189,000 and > > deletedDocs is about 96,000. Even if I assume that some of the "deleted" > > docs have just been replaced by newer content, I am not ready to believe > > that has happened to so many of them. > > Should I use a different indexer, or different settings, or something other > > than an indexer for this purpose? > > > >

