The dedup stage fails with the following error. SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/collection5 Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:390) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:395)
On Sat, Jun 22, 2013 at 8:03 AM, Tejas Patil <[email protected]>wrote: > Thanks Joe for pointing it out. There was a jira [0] for this bug and the > change is already present in the trunk. > > [0] : https://issues.apache.org/jira/browse/NUTCH-1500 > > > On Fri, Jun 21, 2013 at 7:11 PM, Joe Zhang <[email protected]> wrote: > > > The new crawl script is quite useful. Thanks for the addition. > > > > It comes with a bug, though: > > > > > > Line 169: > > $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb > > $CRAWL_PATH/linkdb $SEGMENT > > > > should be: > > > > $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb > > $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT > > > > instead. > > >

