Thanks for the pointer, haven't looked at the sources on this one (although i
should).
> You must have the boost and digest fields in the solr to use dedup.
>
> This is description from javadoc for the class.
>
> MapReduce:
> * <ul>
> * <li>Map: Identity map where keys are digests and values are {...@link
> SolrRecord}
> * instances(which contain id, boost and timestamp)</li>
> * <li>Reduce: After map, {...@link SolrRecord}s with the same digest will be
> * grouped together. Now, of these documents with the same digests, delete
> * all of them except the one with the highest score (boost field). If two
> * (or more) documents have the same score, then the document with the
> latest
> * timestamp is kept. Again, every other is deleted from solr index.
> * </li>
> * </ul>
>
> So boost plays here certain role.
>
> In question why you had NullPointerException. If you don't have any of
> these fields (id, boost, digest) in a solr document then these primitive
> data types cannot be assigned.
>
> private float boost;
> private long tstamp;
> private String id;
>
>
> Best Regards
> Alexander Aristov
>
> On 4 November 2010 21:49, Markus Jelsma <[email protected]> wrote:
> > Found the problem. The boost field was removed but it seems the dedup job
> > needs
> > it. I haven't tested it but since i recently removed the field it makes
> > sense.
> >
> > Why would i need the boost field anyway and why does the dedup job needs
> > it?
> >
> > > Hi all,
> > >
> > > For some reason i get an exception on this job. Don't know since when
> > > but something is going wrong somewhere. It's against Solr 1.4.1.
> > >
> > > Here's Solr's log output, just before the exception:
> > >
> > > Nov 4, 2010 2:18:42 PM org.apache.solr.core.SolrCore execute
> > > INFO: [core_name] webapp=/solr path=/select
> >
> > params={fl=id&wt=javabin&q=id:
> > > [*+TO+*]&rows=1&version=1} hits=12197 status=0 QTime=1
> > > Nov 4, 2010 2:18:44 PM org.apache.solr.core.SolrCore execute
> > > INFO: [core_name] webapp=/solr path=/select
> >
> > params={fl=id&wt=javabin&q=id:
> > > [*+TO+*]&rows=1&version=1} hits=12197 status=0 QTime=1
> > > Nov 4, 2010 2:18:45 PM org.apache.solr.core.SolrCore execute
> > > INFO: [core_name] webapp=/solr path=/select
> > > params={fl=id,boost,tstamp,digest&start=0&q=id:
> > > [*+TO+*]&wt=javabin&rows=12197&version=1} hits=12197 status=0 QTime=98
> > > Nov 4, 2010 2:19:00 PM org.apache.solr.core.SolrCore execute
> > > INFO: [core_name] webapp=/solr path=/replication
> > > params={command=indexversion&wt=javabin} status=0 QTime=0
> > > Nov 4, 2010 2:19:00 PM org.apache.solr.handler.SnapPuller
> >
> > fetchLatestIndex
> >
> > > INFO: Slave in sync with master.
> > >
> > > Then Nutch's hadoop.log output:
> > >
> > > 2010-11-04 14:18:35,952 INFO solr.SolrDeleteDuplicates -
> > > SolrDeleteDuplicates: starting at 2010-11-04 14:18:35
> > > 2010-11-04 14:18:35,952 INFO solr.SolrDeleteDuplicates -
> > > SolrDeleteDuplicates: Solr url:
> > > http://127.0.0.1:8983/solr/fcgroningen_master 2010-11-04 14:18:59,652
> >
> > WARN
> >
> > > mapred.LocalJobRunner - job_local_0001 java.lang.NullPointerException
> > >
> > > at
> >
> > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrRecord.readSolrDoc
> > um
> >
> > > ent(SolrDeleteDuplicates.java:129) at
> >
> > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next
> > (S
> >
> > > olrDeleteDuplicates.java:270) at
> >
> > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next
> > (S
> >
> > > olrDeleteDuplicates.java:240) at
> >
> > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.j
> > av
> >
> > > a:192) at
> >
> > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:17
> > 6)
> >
> > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> > > at
> >
> > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> >
> > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> > > at
> > >
> > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177
> > > )
> > >
> > > Anyone got an idea?
> > >
> > > Cheers,