Thanks for the pointer, haven't looked at the sources on this one (although i 
should). 

> You must have the boost and digest fields in the solr to use dedup.
> 
> This is description from javadoc for the class.
> 
> MapReduce:
>  * <ul>
>  * <li>Map: Identity map where keys are digests and values are {...@link
> SolrRecord}
>  * instances(which contain id, boost and timestamp)</li>
>  * <li>Reduce: After map, {...@link SolrRecord}s with the same digest will be
>  * grouped together. Now, of these documents with the same digests, delete
>  * all of them except the one with the highest score (boost field). If two
>  * (or more) documents have the same score, then the document with the
> latest
>  * timestamp is kept. Again, every other is deleted from solr index.
>  * </li>
>  * </ul>
> 
> So boost plays here certain role.
> 
> In question why you had NullPointerException. If you don't have any of
> these fields (id, boost, digest) in a solr document then these primitive
> data types cannot be assigned.
> 
>  private float boost;
>     private long tstamp;
>     private String id;
> 
> 
> Best Regards
> Alexander Aristov
> 
> On 4 November 2010 21:49, Markus Jelsma <[email protected]> wrote:
> > Found the problem. The boost field was removed but it seems the dedup job
> > needs
> > it. I haven't tested it but since i recently removed the field it makes
> > sense.
> > 
> > Why would i need the boost field anyway and why does the dedup job needs
> > it?
> > 
> > > Hi all,
> > > 
> > > For some reason i get an exception on this job. Don't know since when
> > > but something is going wrong somewhere. It's against Solr 1.4.1.
> > > 
> > > Here's Solr's log output, just before the exception:
> > > 
> > > Nov 4, 2010 2:18:42 PM org.apache.solr.core.SolrCore execute
> > > INFO: [core_name] webapp=/solr path=/select
> > 
> > params={fl=id&wt=javabin&q=id:
> > > [*+TO+*]&rows=1&version=1} hits=12197 status=0 QTime=1
> > > Nov 4, 2010 2:18:44 PM org.apache.solr.core.SolrCore execute
> > > INFO: [core_name] webapp=/solr path=/select
> > 
> > params={fl=id&wt=javabin&q=id:
> > > [*+TO+*]&rows=1&version=1} hits=12197 status=0 QTime=1
> > > Nov 4, 2010 2:18:45 PM org.apache.solr.core.SolrCore execute
> > > INFO: [core_name] webapp=/solr path=/select
> > > params={fl=id,boost,tstamp,digest&start=0&q=id:
> > > [*+TO+*]&wt=javabin&rows=12197&version=1} hits=12197 status=0 QTime=98
> > > Nov 4, 2010 2:19:00 PM org.apache.solr.core.SolrCore execute
> > > INFO: [core_name] webapp=/solr path=/replication
> > > params={command=indexversion&wt=javabin} status=0 QTime=0
> > > Nov 4, 2010 2:19:00 PM org.apache.solr.handler.SnapPuller
> > 
> > fetchLatestIndex
> > 
> > > INFO: Slave in sync with master.
> > > 
> > > Then Nutch's hadoop.log output:
> > > 
> > > 2010-11-04 14:18:35,952 INFO  solr.SolrDeleteDuplicates -
> > > SolrDeleteDuplicates: starting at 2010-11-04 14:18:35
> > > 2010-11-04 14:18:35,952 INFO  solr.SolrDeleteDuplicates -
> > > SolrDeleteDuplicates: Solr url:
> > > http://127.0.0.1:8983/solr/fcgroningen_master 2010-11-04 14:18:59,652
> > 
> > WARN
> > 
> > >  mapred.LocalJobRunner - job_local_0001 java.lang.NullPointerException
> > >  
> > >         at
> > 
> > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrRecord.readSolrDoc
> > um
> > 
> > > ent(SolrDeleteDuplicates.java:129) at
> > 
> > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next
> > (S
> > 
> > > olrDeleteDuplicates.java:270) at
> > 
> > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next
> > (S
> > 
> > > olrDeleteDuplicates.java:240) at
> > 
> > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.j
> > av
> > 
> > > a:192) at
> > 
> > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:17
> > 6)
> > 
> > >         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> > >         at
> > 
> > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> > 
> > >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> > >         at
> > > 
> > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177
> > > )
> > > 
> > > Anyone got an idea?
> > > 
> > > Cheers,

Reply via email to