I just found out this was logged by Markus many moons ago
https://issues.apache.org/jira/browse/NUTCH-992
It would be nice if you could update this Jira issue with any progress you
are able to make on it.
I am not able to help right now sorry.
Lewis


On Fri, Apr 26, 2013 at 2:14 PM, brian4 <[email protected]> wrote:

> I have nutch 2.1 with hbase 0.90.6 and solr 3.6 and have been stepping
> through the basic crawl process for just one cycle.  I finally got it to
> crawl and index my first webpage after many hours of searching the web, but
> I am stuck on the de-duplication step and am hoping someone can help.
>
> I did each command in the following sequence and everything went fine - I
> checked dbread dump at each step to see the changes and was able to find
> the
> indexed page in solr admin.  (Note: I could not get this to work at all
> when
> using "-crawlId" option as with the example crawl script in 2.x - no urls
> would be processed in this case and I was getting the common ~"batch id
> doesn't match" error, but that's a separate issue I'll deal with next).
>
> $bin/nutch inject $URLDIR
> $bin/nutch generate
> $bin/nutch fetch
> $bin/nutch parse -all
> $bin/nutch updatedb
> $bin/nutch solrindex $SOLRURL -reindex
>
> >>output of last step:
> SolrIndexerJob: starting
> Adding 1 documents
> SolrIndexerJob: done.
>
> (I used reindex because indexing was giving a different error at first
> which
> I fixed by removing the id field in solrindex-mapping.xml, so the page
> already had an index mark and I wanted to be sure it would still be
> indexed)
>
>
> However, when I tried:
>
> $bin/nutch solrdedup $SOLRURL
>
> I get the following error in the command window:
> Exception in thread "Main Thread" java.lang.NullPointerException
>         at
>
> org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
>         at
>
> org.apache.hadoop.mapreduce.split.JobSplitWriter.writeNewSplits(JobSplitWriter.java:123)
>         at
>
> org.apache.hadoop.mapreduce.split.JobSplitWriter.createSplitFiles(JobSplitWriter.java:74)
>         at
> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:968)
>         at
> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979)
>         at
> org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>         at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
>         at
>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:371)
>         at
>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:382)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at
>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:393)
>
>
> And in my hadoop.log file the only lines added were:
>
> 2013-04-26 16:36:42,784 INFO  solr.SolrDeleteDuplicates -
> SolrDeleteDuplicates: starting...
> 2013-04-26 16:36:42,793 INFO  solr.SolrDeleteDuplicates -
> SolrDeleteDuplicates: Solr url: [$SOLRURL]
> 2013-04-26 16:36:43,089 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2013-04-26 16:36:43,129 WARN  mapred.JobClient - No job jar file set.  User
> classes may not be found. See JobConf(Class) or JobConf#setJar(String).
>
>
>
> Does anyone have any idea how to fix this?  I would greatly appreciate any
> help!
>
>
> In case it helps:
>
> -plugins in my nutch-site.xml:
> <property>
>   <name>plugin.includes</name>
>
>
> <value>subcollection|protocol-httpclient|urlfilter-regex|parse-(html|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>   </property>
>
> -I also have the following analyzer definition in schema.xml:
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>  <analyzer>
>
>   <tokenizer class="solr.WhitespaceTokenizerFactory" />
>   <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
>
>   <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="0" />
>   <filter class="solr.LowerCaseFilterFactory" />
>   <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc"
> mode="compose" />
>   <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false" />
>   <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"
> />
>   <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>   </analyzer>
>   </fieldType>
>   <fieldType name="date" class="solr.TrieDateField" omitNorms="true"
> precisionStep="6" positionIncrementGap="0" />
>   </types>
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solrdedup-NullPointerException-tp4059389.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Reply via email to