Re: What kind of nutch documents does Solr index?

Daniel Holmes Wed, 30 Sep 2015 05:08:24 -0700

Thank you Upayavira for your anser. In the case I described maxDoc is 19263.
As I check the Nutch, default indexing filter in Nutch is basic indexing
filter and also it have a property to delete gone and permanently
redirected pages which it value was false for me.
I think the problem is still remained for solr.



On Mon, Sep 28, 2015 at 3:03 PM, Upayavira <u...@odoko.co.uk> wrote:

> I suspect you may be better off asking this on the Nutch user list. The
> decisions you are describing will be within the Nutch codebase, not
> Solr. Someone here may know (hopefully) but you may get more support
> over on the Nutch list.
>
> One suggestion -start with a clean, empty index. Run a crawl. Look at
> the maxDocs vs numDocs (visible via the admin UI for your
> core/collection). If maxDocs>numDocs, it means that some docs have been
> overwritten - i.e. the ID field that Nutch is using is not unique.
>
> Upayavira
>
> On Mon, Sep 28, 2015, at 10:19 AM, Daniel Holmes wrote:
> > Hi,
> > I am using apache Nutch 1.7 to crawl and apache Solr 4.7.2 for indexing.
> > In
> > my tests there is a gap between number of fetched results of Nutch and
> > number of indexed documents in Solr. For example one of the crawls is
> > fetched 23343 pages and 1146 images successfully while in the Solr 19250
> > docs is indexed and 500 of them is image urls.
> >
> > My question is that what kind of pages are indexed is solr and why?
> > Does Solr index pages whit other status or not?
> > what kind of images does Solr index?
> >
> > Thanks.
>

Re: What kind of nutch documents does Solr index?

Reply via email to