With my new news crawl, I would like to keep web pages in the index, even after 
they have disappeared from the web, so I can continue using them in 
machine-learning processes. I thought I could achieve this by avoiding running 
cleaning jobs. However, I still notice increasing numbers of deletions in my 
solr index.
When and why does nutch tell the indexer to delete documents, other than during 
cleaningJob?
For example, recently, Solr tells me that numDocs is about 189,000 and 
deletedDocs is about 96,000. Even if I assume that some of the "deleted" docs 
have just been replaced by newer content, I am not ready to believe that has 
happened to so many of them.
Should I use a different indexer, or different settings, or something other 
than an indexer for this purpose?

Reply via email to