Re: Indexing to Solr4.2 with nutch 1.6

Lewis John Mcgibbney Wed, 10 Apr 2013 11:32:01 -0700

>From memory we always use the id as the unique key with no exceptions.
As for use of the ConcurrentUpdateSolrServer, this is not correct (my bad)
we should just use HttpSolrServer and use the defaults.
I will update the patch and I will also cook up one for trunk.
Thanks for your feedback Amit. It is really helpful.
Lewis



On Wed, Apr 10, 2013 at 11:27 AM, Amit Sela <[email protected]> wrote:

> Yep. That seemed to be the problem.
> If the id field is to be set by schema.XML then it shouldn't be constant.
> Or decide that Nutch always uses id as unique key.
> On Apr 10, 2013 6:01 PM, "Amit Sela" <[email protected]> wrote:
>
> > I saw the patch for nutch 2.x where you replaced CommonsHttpSolrServer
> > with ConcurrentUpdateSolrServer but in 1.6
> > StringUtils.getCommonsHttpSolrServer is used for getting SolrServer.
> > Should we add a getConcurrentUpdateSolrServer to SolrUtils ?
> > As I understand it, the exception I got was caused by an empty result set
> > returned by SolrQuery... could it be because of using url as uniqueKey ?
> > I see in SolrDeleteDuplicates.java :
> > line: 226: solnrQuery.setFields(*SolrConstants.ID_FIELD*,
> > SolrConstants.BOOST_FIELD,
> >                           SolrConstants.TIMESTAMP_FIELD,
> >                           SolrConstants.DIGEST_FIELD);
> >
> >
> >
> > On Tue, Apr 9, 2013 at 9:15 PM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> >> Before we do the upgrade we need to consolidate all of these use cases.
> >> What criteria do we want to review and accept as the unique key? Will
> this
> >> change between Nutch trunk and 2.x?
> >>
> >> On Tuesday, April 9, 2013, Amit Sela <[email protected]> wrote:
> >> > Well, according to our other corresponding, the only thing I did
> >> different
> >> > in my schema.xml (schema-solr4.xml) before rebuilding nutch was the
> >> >  <uniqueKey>url</uniqueKey> instead of <uniqueKey>id</uniqueKey>.
> >> >
> >> > It all goes well until the dedup phase where the MapReduce throws:
> >> >
> >> > java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
> >> > at java.util.ArrayList.rangeCheck(ArrayList.java:604)
> >> > at java.util.ArrayList.get(ArrayList.java:382)
> >> > at
> >> >
> >>
> >>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:268)
> >> > at
> >> >
> >>
> >>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)
> >> > at
> >> >
> >>
> >>
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
> >> > at
> >> >
> >>
> >>
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:216)
> >> > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >> > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> >> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> >> > at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> >> > at java.security.AccessController.doPrivileged(Native Method)
> >> > at javax.security.auth.Subject.doAs(Subject.java:415)
> >> > at
> >> >
> >>
> >>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> >> > at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >> >
> >> > Thanks.
> >> >
> >> >
> >> > On Mon, Apr 8, 2013 at 10:33 PM, Lewis John Mcgibbney <
> >> > [email protected]> wrote:
> >> >
> >> >> I would probably be best to describe what you've tried here,
> possibly a
> >> >> paste of your schema, what you've done (if anything) to the Nutch
> >> source
> >> to
> >> >> get it working with Solr 4, etc.
> >> >> The stack trace you get would also be beneficial.
> >> >> Thank you
> >> >> Lewis
> >> >>
> >> >>
> >> >> On Mon, Apr 8, 2013 at 4:13 AM, Amit Sela <[email protected]>
> wrote:
> >> >>
> >> >> > Is it possible ? I saw a Jira open about connecting to SolrCloud
> via
> >> >> > ZooKeeper but in direct connection to one of the server is it
> >> possible
> >> to
> >> >> > index with nutch 1.6 into Solr4.2 setup as cloud with ZooKeeper
> >> ensemble
> >> >> ?
> >> >> > because I keep getting IndexOutOfBounds exceptions in the dedup M/R
> >> >> phase.
> >> >> >
> >> >> > Thanks.
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> *Lewis*
> >> >>
> >> >
> >>
> >> --
> >> *Lewis*
> >>
> >
> >
>



-- 
*Lewis*

Re: Indexing to Solr4.2 with nutch 1.6

Reply via email to