With my new news crawl, I would like to keep web pages in the index, even after
they have disappeared from the web, so I can continue using them in
machine-learning processes. I thought I could achieve this by avoiding running
cleaning jobs. However, I still notice increasing numbers of deletions in my
solr index.
When and why does nutch tell the indexer to delete documents, other than during
cleaningJob?
For example, recently, Solr tells me that numDocs is about 189,000 and
deletedDocs is about 96,000. Even if I assume that some of the "deleted" docs
have just been replaced by newer content, I am not ready to believe that has
happened to so many of them.
Should I use a different indexer, or different settings, or something other
than an indexer for this purpose?