Hello,

we have a working installation of nutch 1.6 and solr 4.0.0
Now we did try to upgrade to nutch 1.11 and solr 6.4.0.

So far crawling works with 1.11 as intended, but adding the documents to solr 
fail because of the unique constraint of the id field.

We see this error when nutch trys to submit to solr:


java.lang.Exception: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Document 
contains multiple values for uniqueKey field: 
id=[http://www.mysite.ch/de/start.html, http://www.mysite.ch/]
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Document 
contains multiple values for uniqueKey field: 
id=[http://www.mysite.ch/de/start.html, http://www.mysite.ch/]
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
        at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
        at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
        at 
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
        at 
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
        at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2017-01-30 12:16:41,274 ERROR indexer.IndexingJob - Indexer: 
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)

The url http://www.mysite.ch redirects with a 301 status to 
http://www.mysite.ch/de/start.html

My solrindex-mapping.xml looks like this:

<mapping>
  <fields>
                <field dest="fullContent" source="content" />
                <field dest="content" source="strippedContent" />
                <field dest="title" source="title"/>
               <field dest="host" source="host"/>
                <field dest="segment" source="segment"/>
                <field dest="boost" source="boost"/>
                <field dest="digest" source="digest"/>
                <field dest="tstamp" source="tstamp"/>
                <field dest="id" source="url"/>
                <field dest="lang" source="lang"/>
                <field dest="metatag-description" source="metatag.description" 
/>
                <field dest="metatag-keywords" source="metatag.keywords" />
                <copyField source="url" dest="url"/>
  </fields>
  <uniqueKey>id</uniqueKey>
</mapping>

And the (relevant parts of the) solr schema:

  <uniqueKey>id</uniqueKey>

I see why this causes problems.
How can I tell nutch to submit only one URL (Ideally the original url) to solr, 
and not both?


André Schild

Aarboard AG<http://www.aarboard.ch/>
Egliweg 10
2560 Nidau
Switzerland
+41 32 332 97 14<tel:+41323329714>

Reply via email to