RE: Re: Nutch SolrIndex command not adding documents

Markus Jelsma Mon, 02 Aug 2010 10:04:30 -0700

Hi,

It makes no sense indeed. But check your solrindex-mapping.xml in the Nutch 
configuration directory, it might copy the field. Also, check your schema.xml 
in the Solr configuration for it might do the same.

To make it a bit more complicated, don't you have some deduplication mechanism 
somewhere? It can prevent any additions to the index if you didn't properly 
configure it, such as a recurring field value as a source for the signature.

And, what Nutch and Solr versions are you using? I have had multiple setups 
with Nutch 1.0, 1.1 and trunk and Solr 1.4 and 1.4.1 but never came across your 
error for the title field, some shipped Nutch configurations did actually mess 
up the url and id fields in the Solr index, which are not multi valued.

Cheers,

-----Original message-----
From: Max Lynch <[email protected]>
Sent: Mon 02-08-2010 18:32
To: [email protected]; 
Subject: Re: Nutch SolrIndex command not adding documents

So, I figured out the log debugging stuff (just had to modify some stuff in
log4j.properties), and I've found the source of my solrindex errors.  First
of all, many dates in my index fail to parse properly in
MoreIndexingFilter.java, so I added another date format of the type "EEE MMM
dd HH:mm:ss zzz yyyy" which I will make a bug tracker entry and a patch for.

However, I've also encountered this issue:
"multiple_values_encountered_for_non_multiValued_field_title"
which crashes the job.  In my solr schema I don't allow multiple values for
the "title" field (as per the nutch default).  Why would the parser find
multiple title values?  Seems to be another bug.

Any ideas?

Thanks.

On Sat, Jul 31, 2010 at 9:11 PM, Max Lynch <[email protected]> wrote:

> The solr schema and mappings all seem to work fine.  It's just that
> sometimes I run solrindex and no documents get added to the solr index and I
> have no indication of why that might be.  I see my fetcher grabbing
> thousands of pages and yet my doc count on solr doesn't increase.
>
> I've cleared my index and have been following the steps here:
> http://wiki.apache.org/nutch/RunningNutchAndSolr and it seems to be
> working better.  I'm just not sure why these steps seem to work better yet
> the nutch tutorial steps before didn't.  The only difference I can see is
> the -noParse and parse steps added.
>
> I think it's the non-determinism or lack of output that unsettles me.  Can
> I enable debugging output or something?
>
>
> On Sat, Jul 31, 2010 at 8:34 PM, Scott Gonyea <[email protected]> wrote:
>
>> Did you setup the solr mappings? When you index into nutch, do they appear
>> there when you query nutch's interface?
>>
>> On Jul 31, 2010, at 5:12 PM, Max Lynch <[email protected]> wrote:
>>
>> > Hi,
>> > I'm following the nutch tutorial (
>> http://wiki.apache.org/nutch/NutchTutorial)
>> > and everything seems to be working fine, except when I try to run
>> >
>> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
>> crawl/linkdb
>> > crawl/segments/*
>> >
>> > The document count on my solr server doesn't change (I'm viewing
>> > /solr/admin/stats.jsp).  I've even go so far as to explicitly issue a
>> > <commit /> using curl, with no success.
>> >
>> > It seems like my fetch routine grabs a ton of documents, but only a few
>> make
>> > it to solr if at all (there are about 2000 in there already from a
>> previous
>> > nutch solrindex that added a few).  How can I tell how many documents
>> nutch
>> > is sending to solr?  Should I just modify the solrindex driver program?
>> >
>> > Just for reference, my nutch cycle looks like this:
>> >
>> > $ bin/nutch inject crawlwi/crawldb wiurls/
>> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments
>> >
>> > Then I ran the following a few times, with the newest segment in a
>> variable:
>> > $ s1=`ls -d crawlwi/segments/2* | tail -1`
>> > $ echo $s1
>> > $ bin/nutch fetch $s1 -threads 15
>> > $ bin/nutch updatedb crawlwi/crawldb $s1
>> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments -topN 5000
>> >
>> > Then
>> > $ bin/nutch invertlinks crawlwi/linkdb -dir crawlwi/segments
>> > $ bin/nutch index crawlwi/indexes crawlwi/crawldb crawlwi/linkdb
>> > crawlwi/segments/*
>> > $ bin/nutch solrindex http://127.0.0.1/solr/ crawlwi/crawldb
>> crawlwi/linkdb
>> > crawlwi/segments/*
>> >
>> > But the new documents don't make the index.
>> >
>> > Any ideas?
>> > Thanks.
>>
>
>

RE: Re: Nutch SolrIndex command not adding documents

Reply via email to