Re: Nutch 1.11 redirects and solr uniqueKey problems

Sebastian Nagel Mon, 30 Jan 2017 05:08:10 -0800

Hi André,

have a look on the changes made to address NUTCH-1708 [1] [2]
and try
      <field dest="id" source="id"/>
instead of
      <field dest="id" source="url"/>


Thanks,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-1708
[2] 
https://github.com/apache/nutch/commit/bad0a2076a8c724a0542b923ac10bb812c0de644?diff=unified

On 01/30/2017 12:26 PM, André Schild wrote:
> Hello,
> 
> we have a working installation of nutch 1.6 and solr 4.0.0
> Now we did try to upgrade to nutch 1.11 and solr 6.4.0.
> 
> So far crawling works with 1.11 as intended, but adding the documents to solr 
> fail because of the unique constraint of the id field.
> 
> We see this error when nutch trys to submit to solr:
> 
> 
> java.lang.Exception: 
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
> Document contains multiple values for uniqueKey field: 
> id=[http://www.mysite.ch/de/start.html, http://www.mysite.ch/]
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: 
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
> Document contains multiple values for uniqueKey field: 
> id=[http://www.mysite.ch/de/start.html, http://www.mysite.ch/]
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
>         at 
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
>         at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
>         at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
>         at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
>         at 
> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
>         at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-01-30 12:16:41,274 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>         at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
> 
> The url http://www.mysite.ch redirects with a 301 status to 
> http://www.mysite.ch/de/start.html
> 
> My solrindex-mapping.xml looks like this:
> 
> <mapping>
>   <fields>
>                 <field dest="fullContent" source="content" />
>                 <field dest="content" source="strippedContent" />
>                 <field dest="title" source="title"/>
>                <field dest="host" source="host"/>
>                 <field dest="segment" source="segment"/>
>                 <field dest="boost" source="boost"/>
>                 <field dest="digest" source="digest"/>
>                 <field dest="tstamp" source="tstamp"/>
>                 <field dest="id" source="url"/>
>                 <field dest="lang" source="lang"/>
>                 <field dest="metatag-description" 
> source="metatag.description" />
>                 <field dest="metatag-keywords" source="metatag.keywords" />
>                 <copyField source="url" dest="url"/>
>   </fields>
>   <uniqueKey>id</uniqueKey>
> </mapping>
> 
> And the (relevant parts of the) solr schema:
> 
>   <uniqueKey>id</uniqueKey>
> 
> I see why this causes problems.
> How can I tell nutch to submit only one URL (Ideally the original url) to 
> solr, and not both?
> 
> 
> André Schild
> 
> Aarboard AG<http://www.aarboard.ch/>
> Egliweg 10
> 2560 Nidau
> Switzerland
> +41 32 332 97 14<tel:+41323329714>
> 
>

Re: Nutch 1.11 redirects and solr uniqueKey problems

Reply via email to