Thank you Upayavira for your anser. In the case I described maxDoc is 19263. As I check the Nutch, default indexing filter in Nutch is basic indexing filter and also it have a property to delete gone and permanently redirected pages which it value was false for me. I think the problem is still remained for solr.
On Mon, Sep 28, 2015 at 3:03 PM, Upayavira <u...@odoko.co.uk> wrote: > I suspect you may be better off asking this on the Nutch user list. The > decisions you are describing will be within the Nutch codebase, not > Solr. Someone here may know (hopefully) but you may get more support > over on the Nutch list. > > One suggestion -start with a clean, empty index. Run a crawl. Look at > the maxDocs vs numDocs (visible via the admin UI for your > core/collection). If maxDocs>numDocs, it means that some docs have been > overwritten - i.e. the ID field that Nutch is using is not unique. > > Upayavira > > On Mon, Sep 28, 2015, at 10:19 AM, Daniel Holmes wrote: > > Hi, > > I am using apache Nutch 1.7 to crawl and apache Solr 4.7.2 for indexing. > > In > > my tests there is a gap between number of fetched results of Nutch and > > number of indexed documents in Solr. For example one of the crawls is > > fetched 23343 pages and 1146 images successfully while in the Solr 19250 > > docs is indexed and 500 of them is image urls. > > > > My question is that what kind of pages are indexed is solr and why? > > Does Solr index pages whit other status or not? > > what kind of images does Solr index? > > > > Thanks. >