Yep. That seemed to be the problem. If the id field is to be set by schema.XML then it shouldn't be constant. Or decide that Nutch always uses id as unique key. On Apr 10, 2013 6:01 PM, "Amit Sela" <[email protected]> wrote:
> I saw the patch for nutch 2.x where you replaced CommonsHttpSolrServer > with ConcurrentUpdateSolrServer but in 1.6 > StringUtils.getCommonsHttpSolrServer is used for getting SolrServer. > Should we add a getConcurrentUpdateSolrServer to SolrUtils ? > As I understand it, the exception I got was caused by an empty result set > returned by SolrQuery... could it be because of using url as uniqueKey ? > I see in SolrDeleteDuplicates.java : > line: 226: solnrQuery.setFields(*SolrConstants.ID_FIELD*, > SolrConstants.BOOST_FIELD, > SolrConstants.TIMESTAMP_FIELD, > SolrConstants.DIGEST_FIELD); > > > > On Tue, Apr 9, 2013 at 9:15 PM, Lewis John Mcgibbney < > [email protected]> wrote: > >> Before we do the upgrade we need to consolidate all of these use cases. >> What criteria do we want to review and accept as the unique key? Will this >> change between Nutch trunk and 2.x? >> >> On Tuesday, April 9, 2013, Amit Sela <[email protected]> wrote: >> > Well, according to our other corresponding, the only thing I did >> different >> > in my schema.xml (schema-solr4.xml) before rebuilding nutch was the >> > <uniqueKey>url</uniqueKey> instead of <uniqueKey>id</uniqueKey>. >> > >> > It all goes well until the dedup phase where the MapReduce throws: >> > >> > java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 >> > at java.util.ArrayList.rangeCheck(ArrayList.java:604) >> > at java.util.ArrayList.get(ArrayList.java:382) >> > at >> > >> >> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:268) >> > at >> > >> >> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241) >> > at >> > >> >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236) >> > at >> > >> >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:216) >> > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) >> > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) >> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) >> > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) >> > at java.security.AccessController.doPrivileged(Native Method) >> > at javax.security.auth.Subject.doAs(Subject.java:415) >> > at >> > >> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) >> > at org.apache.hadoop.mapred.Child.main(Child.java:249) >> > >> > Thanks. >> > >> > >> > On Mon, Apr 8, 2013 at 10:33 PM, Lewis John Mcgibbney < >> > [email protected]> wrote: >> > >> >> I would probably be best to describe what you've tried here, possibly a >> >> paste of your schema, what you've done (if anything) to the Nutch >> source >> to >> >> get it working with Solr 4, etc. >> >> The stack trace you get would also be beneficial. >> >> Thank you >> >> Lewis >> >> >> >> >> >> On Mon, Apr 8, 2013 at 4:13 AM, Amit Sela <[email protected]> wrote: >> >> >> >> > Is it possible ? I saw a Jira open about connecting to SolrCloud via >> >> > ZooKeeper but in direct connection to one of the server is it >> possible >> to >> >> > index with nutch 1.6 into Solr4.2 setup as cloud with ZooKeeper >> ensemble >> >> ? >> >> > because I keep getting IndexOutOfBounds exceptions in the dedup M/R >> >> phase. >> >> > >> >> > Thanks. >> >> > >> >> >> >> >> >> >> >> -- >> >> *Lewis* >> >> >> > >> >> -- >> *Lewis* >> > >

