So, I had these numbers in my index:
Num Docs: 189550Max Docs: 285531
Deleted Docs: 95981
Then I did a crawl and index, which told meindexed (add/update): 13,423
And now I have these numbers in my index:
Num Docs: 190785Max Docs: 223339Deleted Docs: 32554So, I am completely
confused. I don't use "-deleteGone" but I get massive numbers of deletions.
Is it your theory that Solr's report of deleted docs really just means that
docs were replaced by newer content?
From: Markus Jelsma <[email protected]>
To: "[email protected]" <[email protected]>; User
<[email protected]>
Sent: Monday, October 2, 2017 1:19 PM
Subject: RE: deletions from index
You can check the Hadoop job's counters to see how many are being deleted. If
some are, then -deleteGone is on in your case. Only with that setting documents
are going to be deleted.
-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Monday 2nd October 2017 21:51
> To: User <[email protected]>
> Subject: deletions from index
>
> With my new news crawl, I would like to keep web pages in the index, even
> after they have disappeared from the web, so I can continue using them in
> machine-learning processes. I thought I could achieve this by avoiding
> running cleaning jobs. However, I still notice increasing numbers of
> deletions in my solr index.
> When and why does nutch tell the indexer to delete documents, other than
> during cleaningJob?
> For example, recently, Solr tells me that numDocs is about 189,000 and
> deletedDocs is about 96,000. Even if I assume that some of the "deleted" docs
> have just been replaced by newer content, I am not ready to believe that has
> happened to so many of them.
> Should I use a different indexer, or different settings, or something other
> than an indexer for this purpose?
>