Hi André,
have a look on the changes made to address NUTCH-1708 [1] [2]
and try
<field dest="id" source="id"/>
instead of
<field dest="id" source="url"/>
Thanks,
Sebastian
[1] https://issues.apache.org/jira/browse/NUTCH-1708
[2]
https://github.com/apache/nutch/commit/bad0a2076a8c724a0542b923ac10bb812c0de644?diff=unified
On 01/30/2017 12:26 PM, André Schild wrote:
> Hello,
>
> we have a working installation of nutch 1.6 and solr 4.0.0
> Now we did try to upgrade to nutch 1.11 and solr 6.4.0.
>
> So far crawling works with 1.11 as intended, but adding the documents to solr
> fail because of the unique constraint of the id field.
>
> We see this error when nutch trys to submit to solr:
>
>
> java.lang.Exception:
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> Document contains multiple values for uniqueKey field:
> id=[http://www.mysite.ch/de/start.html, http://www.mysite.ch/]
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by:
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> Document contains multiple values for uniqueKey field:
> id=[http://www.mysite.ch/de/start.html, http://www.mysite.ch/]
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
> at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
> at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
> at
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
> at
> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
> at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2017-01-30 12:16:41,274 ERROR indexer.IndexingJob - Indexer:
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
>
> The url http://www.mysite.ch redirects with a 301 status to
> http://www.mysite.ch/de/start.html
>
> My solrindex-mapping.xml looks like this:
>
> <mapping>
> <fields>
> <field dest="fullContent" source="content" />
> <field dest="content" source="strippedContent" />
> <field dest="title" source="title"/>
> <field dest="host" source="host"/>
> <field dest="segment" source="segment"/>
> <field dest="boost" source="boost"/>
> <field dest="digest" source="digest"/>
> <field dest="tstamp" source="tstamp"/>
> <field dest="id" source="url"/>
> <field dest="lang" source="lang"/>
> <field dest="metatag-description"
> source="metatag.description" />
> <field dest="metatag-keywords" source="metatag.keywords" />
> <copyField source="url" dest="url"/>
> </fields>
> <uniqueKey>id</uniqueKey>
> </mapping>
>
> And the (relevant parts of the) solr schema:
>
> <uniqueKey>id</uniqueKey>
>
> I see why this causes problems.
> How can I tell nutch to submit only one URL (Ideally the original url) to
> solr, and not both?
>
>
> André Schild
>
> Aarboard AG<http://www.aarboard.ch/>
> Egliweg 10
> 2560 Nidau
> Switzerland
> +41 32 332 97 14<tel:+41323329714>
>
>