Hi,
I'm following the nutch tutorial (http://wiki.apache.org/nutch/NutchTutorial)
and everything seems to be working fine, except when I try to run

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
crawl/segments/*

The document count on my solr server doesn't change (I'm viewing
/solr/admin/stats.jsp).  I've even go so far as to explicitly issue a
<commit /> using curl, with no success.

It seems like my fetch routine grabs a ton of documents, but only a few make
it to solr if at all (there are about 2000 in there already from a previous
nutch solrindex that added a few).  How can I tell how many documents nutch
is sending to solr?  Should I just modify the solrindex driver program?

Just for reference, my nutch cycle looks like this:

$ bin/nutch inject crawlwi/crawldb wiurls/
$ bin/nutch generate crawlwi/crawldb crawlwi/segments

Then I ran the following a few times, with the newest segment in a variable:
$ s1=`ls -d crawlwi/segments/2* | tail -1`
$ echo $s1
$ bin/nutch fetch $s1 -threads 15
$ bin/nutch updatedb crawlwi/crawldb $s1
$ bin/nutch generate crawlwi/crawldb crawlwi/segments -topN 5000

Then
$ bin/nutch invertlinks crawlwi/linkdb -dir crawlwi/segments
$ bin/nutch index crawlwi/indexes crawlwi/crawldb crawlwi/linkdb
crawlwi/segments/*
$ bin/nutch solrindex http://127.0.0.1/solr/ crawlwi/crawldb crawlwi/linkdb
crawlwi/segments/*

But the new documents don't make the index.

Any ideas?
Thanks.

Reply via email to