So, I had these numbers in my index:
Num Docs: 189550Max Docs: 285531
Deleted Docs: 95981

Then I did a crawl and index, which told meindexed (add/update): 13,423
And now I have these numbers in my index:

Num Docs: 190785Max Docs: 223339Deleted Docs: 32554So, I am completely 
confused. I don't use "-deleteGone" but I get massive numbers of deletions.

Is it your theory that Solr's report of deleted docs really just means that 
docs were replaced by newer content?


      From: Markus Jelsma <[email protected]>
 To: "[email protected]" <[email protected]>; User 
<[email protected]> 
 Sent: Monday, October 2, 2017 1:19 PM
 Subject: RE: deletions from index
   
You can check the Hadoop job's counters to see how many are being deleted. If 
some are, then -deleteGone is on in your case. Only with that setting documents 
are going to be deleted.

 
 
-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Monday 2nd October 2017 21:51
> To: User <[email protected]>
> Subject: deletions from index
> 
> With my new news crawl, I would like to keep web pages in the index, even 
> after they have disappeared from the web, so I can continue using them in 
> machine-learning processes. I thought I could achieve this by avoiding 
> running cleaning jobs. However, I still notice increasing numbers of 
> deletions in my solr index.
> When and why does nutch tell the indexer to delete documents, other than 
> during cleaningJob?
> For example, recently, Solr tells me that numDocs is about 189,000 and 
> deletedDocs is about 96,000. Even if I assume that some of the "deleted" docs 
> have just been replaced by newer content, I am not ready to believe that has 
> happened to so many of them.
> Should I use a different indexer, or different settings, or something other 
> than an indexer for this purpose?
> 

   

Reply via email to