You must have the boost and digest fields in the solr to use dedup.
This is description from javadoc for the class.
MapReduce:
* <ul>
* <li>Map: Identity map where keys are digests and values are {...@link
SolrRecord}
* instances(which contain id, boost and timestamp)</li>
* <li>Reduce: After map, {...@link SolrRecord}s with the same digest will be
* grouped together. Now, of these documents with the same digests, delete
* all of them except the one with the highest score (boost field). If two
* (or more) documents have the same score, then the document with the
latest
* timestamp is kept. Again, every other is deleted from solr index.
* </li>
* </ul>
So boost plays here certain role.
In question why you had NullPointerException. If you don't have any of these
fields (id, boost, digest) in a solr document then these primitive data
types cannot be assigned.
private float boost;
private long tstamp;
private String id;
Best Regards
Alexander Aristov
On 4 November 2010 21:49, Markus Jelsma <[email protected]> wrote:
> Found the problem. The boost field was removed but it seems the dedup job
> needs
> it. I haven't tested it but since i recently removed the field it makes
> sense.
>
> Why would i need the boost field anyway and why does the dedup job needs
> it?
>
> > Hi all,
> >
> > For some reason i get an exception on this job. Don't know since when but
> > something is going wrong somewhere. It's against Solr 1.4.1.
> >
> > Here's Solr's log output, just before the exception:
> >
> > Nov 4, 2010 2:18:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [core_name] webapp=/solr path=/select
> params={fl=id&wt=javabin&q=id:
> > [*+TO+*]&rows=1&version=1} hits=12197 status=0 QTime=1
> > Nov 4, 2010 2:18:44 PM org.apache.solr.core.SolrCore execute
> > INFO: [core_name] webapp=/solr path=/select
> params={fl=id&wt=javabin&q=id:
> > [*+TO+*]&rows=1&version=1} hits=12197 status=0 QTime=1
> > Nov 4, 2010 2:18:45 PM org.apache.solr.core.SolrCore execute
> > INFO: [core_name] webapp=/solr path=/select
> > params={fl=id,boost,tstamp,digest&start=0&q=id:
> > [*+TO+*]&wt=javabin&rows=12197&version=1} hits=12197 status=0 QTime=98
> > Nov 4, 2010 2:19:00 PM org.apache.solr.core.SolrCore execute
> > INFO: [core_name] webapp=/solr path=/replication
> > params={command=indexversion&wt=javabin} status=0 QTime=0
> > Nov 4, 2010 2:19:00 PM org.apache.solr.handler.SnapPuller
> fetchLatestIndex
> > INFO: Slave in sync with master.
> >
> > Then Nutch's hadoop.log output:
> >
> > 2010-11-04 14:18:35,952 INFO solr.SolrDeleteDuplicates -
> > SolrDeleteDuplicates: starting at 2010-11-04 14:18:35
> > 2010-11-04 14:18:35,952 INFO solr.SolrDeleteDuplicates -
> > SolrDeleteDuplicates: Solr url:
> > http://127.0.0.1:8983/solr/fcgroningen_master 2010-11-04 14:18:59,652
> WARN
> > mapred.LocalJobRunner - job_local_0001 java.lang.NullPointerException
> > at
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrRecord.readSolrDocum
> > ent(SolrDeleteDuplicates.java:129) at
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(S
> > olrDeleteDuplicates.java:270) at
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(S
> > olrDeleteDuplicates.java:240) at
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.jav
> > a:192) at
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> > at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> > at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> >
> > Anyone got an idea?
> >
> > Cheers,
>