Re: Nutch SolrIndex command not adding documents

Max Lynch Mon, 02 Aug 2010 09:32:41 -0700

So, I figured out the log debugging stuff (just had to modify some stuff in
log4j.properties), and I've found the source of my solrindex errors.  First
of all, many dates in my index fail to parse properly in
MoreIndexingFilter.java, so I added another date format of the type "EEE MMM
dd HH:mm:ss zzz yyyy" which I will make a bug tracker entry and a patch for.


However, I've also encountered this issue:
"multiple_values_encountered_for_non_multiValued_field_title"
which crashes the job.  In my solr schema I don't allow multiple values for
the "title" field (as per the nutch default).  Why would the parser find
multiple title values?  Seems to be another bug.

Any ideas?

Thanks.


On Sat, Jul 31, 2010 at 9:11 PM, Max Lynch <[email protected]> wrote:

> The solr schema and mappings all seem to work fine.  It's just that
> sometimes I run solrindex and no documents get added to the solr index and I
> have no indication of why that might be.  I see my fetcher grabbing
> thousands of pages and yet my doc count on solr doesn't increase.
>
> I've cleared my index and have been following the steps here:
> http://wiki.apache.org/nutch/RunningNutchAndSolr and it seems to be
> working better.  I'm just not sure why these steps seem to work better yet
> the nutch tutorial steps before didn't.  The only difference I can see is
> the -noParse and parse steps added.
>
> I think it's the non-determinism or lack of output that unsettles me.  Can
> I enable debugging output or something?
>
>
> On Sat, Jul 31, 2010 at 8:34 PM, Scott Gonyea <[email protected]> wrote:
>
>> Did you setup the solr mappings? When you index into nutch, do they appear
>> there when you query nutch's interface?
>>
>> On Jul 31, 2010, at 5:12 PM, Max Lynch <[email protected]> wrote:
>>
>> > Hi,
>> > I'm following the nutch tutorial (
>> http://wiki.apache.org/nutch/NutchTutorial)
>> > and everything seems to be working fine, except when I try to run
>> >
>> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
>> crawl/linkdb
>> > crawl/segments/*
>> >
>> > The document count on my solr server doesn't change (I'm viewing
>> > /solr/admin/stats.jsp).  I've even go so far as to explicitly issue a
>> > <commit /> using curl, with no success.
>> >
>> > It seems like my fetch routine grabs a ton of documents, but only a few
>> make
>> > it to solr if at all (there are about 2000 in there already from a
>> previous
>> > nutch solrindex that added a few).  How can I tell how many documents
>> nutch
>> > is sending to solr?  Should I just modify the solrindex driver program?
>> >
>> > Just for reference, my nutch cycle looks like this:
>> >
>> > $ bin/nutch inject crawlwi/crawldb wiurls/
>> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments
>> >
>> > Then I ran the following a few times, with the newest segment in a
>> variable:
>> > $ s1=`ls -d crawlwi/segments/2* | tail -1`
>> > $ echo $s1
>> > $ bin/nutch fetch $s1 -threads 15
>> > $ bin/nutch updatedb crawlwi/crawldb $s1
>> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments -topN 5000
>> >
>> > Then
>> > $ bin/nutch invertlinks crawlwi/linkdb -dir crawlwi/segments
>> > $ bin/nutch index crawlwi/indexes crawlwi/crawldb crawlwi/linkdb
>> > crawlwi/segments/*
>> > $ bin/nutch solrindex http://127.0.0.1/solr/ crawlwi/crawldb
>> crawlwi/linkdb
>> > crawlwi/segments/*
>> >
>> > But the new documents don't make the index.
>> >
>> > Any ideas?
>> > Thanks.
>>
>
>

Re: Nutch SolrIndex command not adding documents

Reply via email to