Re: Re: Nutch SolrIndex command not adding documents

Max Lynch Mon, 09 Aug 2010 08:29:31 -0700

I set multiValued="true" on my schema and I don't see the error anymore.
Could it be the interaction with the parse-feed plugin?


Either way, it's working so I'm happy.

I'm on nutch 1.1 and solr 1.4.1

On Mon, Aug 2, 2010 at 12:03 PM, Markus Jelsma <[email protected]>wrote:

> Hi,
>
>
>
> It makes no sense indeed. But check your solrindex-mapping.xml in the Nutch
> configuration directory, it might copy the field. Also, check your
> schema.xml in the Solr configuration for it might do the same.
>
>
>
> To make it a bit more complicated, don't you have some deduplication
> mechanism somewhere? It can prevent any additions to the index if you didn't
> properly configure it, such as a recurring field value as a source for the
> signature.
>
>
>
> And, what Nutch and Solr versions are you using? I have had multiple setups
> with Nutch 1.0, 1.1 and trunk and Solr 1.4 and 1.4.1 but never came across
> your error for the title field, some shipped Nutch configurations did
> actually mess up the url and id fields in the Solr index, which are not
> multi valued.
>
>
> Cheers,
>
> -----Original message-----
> From: Max Lynch <[email protected]>
> Sent: Mon 02-08-2010 18:32
> To: [email protected];
> Subject: Re: Nutch SolrIndex command not adding documents
>
> So, I figured out the log debugging stuff (just had to modify some stuff in
> log4j.properties), and I've found the source of my solrindex errors.  First
> of all, many dates in my index fail to parse properly in
> MoreIndexingFilter.java, so I added another date format of the type "EEE
> MMM
> dd HH:mm:ss zzz yyyy" which I will make a bug tracker entry and a patch
> for.
>
> However, I've also encountered this issue:
> "multiple_values_encountered_for_non_multiValued_field_title"
> which crashes the job.  In my solr schema I don't allow multiple values for
> the "title" field (as per the nutch default).  Why would the parser find
> multiple title values?  Seems to be another bug.
>
> Any ideas?
>
> Thanks.
>
>
> On Sat, Jul 31, 2010 at 9:11 PM, Max Lynch <[email protected]> wrote:
>
> > The solr schema and mappings all seem to work fine.  It's just that
> > sometimes I run solrindex and no documents get added to the solr index
> and I
> > have no indication of why that might be.  I see my fetcher grabbing
> > thousands of pages and yet my doc count on solr doesn't increase.
> >
> > I've cleared my index and have been following the steps here:
> > http://wiki.apache.org/nutch/RunningNutchAndSolr and it seems to be
> > working better.  I'm just not sure why these steps seem to work better
> yet
> > the nutch tutorial steps before didn't.  The only difference I can see is
> > the -noParse and parse steps added.
> >
> > I think it's the non-determinism or lack of output that unsettles me.
>  Can
> > I enable debugging output or something?
> >
> >
> > On Sat, Jul 31, 2010 at 8:34 PM, Scott Gonyea <[email protected]> wrote:
> >
> >> Did you setup the solr mappings? When you index into nutch, do they
> appear
> >> there when you query nutch's interface?
> >>
> >> On Jul 31, 2010, at 5:12 PM, Max Lynch <[email protected]> wrote:
> >>
> >> > Hi,
> >> > I'm following the nutch tutorial (
> >> http://wiki.apache.org/nutch/NutchTutorial)
> >> > and everything seems to be working fine, except when I try to run
> >> >
> >> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> >> crawl/linkdb
> >> > crawl/segments/*
> >> >
> >> > The document count on my solr server doesn't change (I'm viewing
> >> > /solr/admin/stats.jsp).  I've even go so far as to explicitly issue a
> >> > <commit /> using curl, with no success.
> >> >
> >> > It seems like my fetch routine grabs a ton of documents, but only a
> few
> >> make
> >> > it to solr if at all (there are about 2000 in there already from a
> >> previous
> >> > nutch solrindex that added a few).  How can I tell how many documents
> >> nutch
> >> > is sending to solr?  Should I just modify the solrindex driver
> program?
> >> >
> >> > Just for reference, my nutch cycle looks like this:
> >> >
> >> > $ bin/nutch inject crawlwi/crawldb wiurls/
> >> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments
> >> >
> >> > Then I ran the following a few times, with the newest segment in a
> >> variable:
> >> > $ s1=`ls -d crawlwi/segments/2* | tail -1`
> >> > $ echo $s1
> >> > $ bin/nutch fetch $s1 -threads 15
> >> > $ bin/nutch updatedb crawlwi/crawldb $s1
> >> > $ bin/nutch generate crawlwi/crawldb crawlwi/segments -topN 5000
> >> >
> >> > Then
> >> > $ bin/nutch invertlinks crawlwi/linkdb -dir crawlwi/segments
> >> > $ bin/nutch index crawlwi/indexes crawlwi/crawldb crawlwi/linkdb
> >> > crawlwi/segments/*
> >> > $ bin/nutch solrindex http://127.0.0.1/solr/ crawlwi/crawldb
> >> crawlwi/linkdb
> >> > crawlwi/segments/*
> >> >
> >> > But the new documents don't make the index.
> >> >
> >> > Any ideas?
> >> > Thanks.
> >>
> >
> >
>

Re: Re: Nutch SolrIndex command not adding documents

Reply via email to