Re: Nutch SolrIndex command not adding documents

Max Lynch Sat, 31 Jul 2010 19:11:58 -0700

The solr schema and mappings all seem to work fine.  It's just that
sometimes I run solrindex and no documents get added to the solr index and I
have no indication of why that might be.  I see my fetcher grabbing
thousands of pages and yet my doc count on solr doesn't increase.


I've cleared my index and have been following the steps here:
http://wiki.apache.org/nutch/RunningNutchAndSolr and it seems to be working
better.  I'm just not sure why these steps seem to work better yet the nutch
tutorial steps before didn't.  The only difference I can see is the -noParse
and parse steps added.

I think it's the non-determinism or lack of output that unsettles me.  Can I
enable debugging output or something?

On Sat, Jul 31, 2010 at 8:34 PM, Scott Gonyea <[email protected]> wrote:

> Did you setup the solr mappings? When you index into nutch, do they appear
> there when you query nutch's interface?
>
> On Jul 31, 2010, at 5:12 PM, Max Lynch <[email protected]> wrote:
>
> > Hi,
> > I'm following the nutch tutorial (
> http://wiki.apache.org/nutch/NutchTutorial)
> > and everything seems to be working fine, except when I try to run
> >
> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> crawl/linkdb
> > crawl/segments/*
> >
> > The document count on my solr server doesn't change (I'm viewing
> > /solr/admin/stats.jsp).  I've even go so far as to explicitly issue a
> > <commit /> using curl, with no success.
> >
> > It seems like my fetch routine grabs a ton of documents, but only a few
> make
> > it to solr if at all (there are about 2000 in there already from a
> previous
> > nutch solrindex that added a few).  How can I tell how many documents
> nutch
> > is sending to solr?  Should I just modify the solrindex driver program?
> >
> > Just for reference, my nutch cycle looks like this:
> >
> > $ bin/nutch inject crawlwi/crawldb wiurls/
> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments
> >
> > Then I ran the following a few times, with the newest segment in a
> variable:
> > $ s1=`ls -d crawlwi/segments/2* | tail -1`
> > $ echo $s1
> > $ bin/nutch fetch $s1 -threads 15
> > $ bin/nutch updatedb crawlwi/crawldb $s1
> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments -topN 5000
> >
> > Then
> > $ bin/nutch invertlinks crawlwi/linkdb -dir crawlwi/segments
> > $ bin/nutch index crawlwi/indexes crawlwi/crawldb crawlwi/linkdb
> > crawlwi/segments/*
> > $ bin/nutch solrindex http://127.0.0.1/solr/ crawlwi/crawldb
> crawlwi/linkdb
> > crawlwi/segments/*
> >
> > But the new documents don't make the index.
> >
> > Any ideas?
> > Thanks.
>

Re: Nutch SolrIndex command not adding documents

Reply via email to