I have nutch 2.1 with hbase 0.90.6 and solr 3.6 and have been stepping
through the basic crawl process for just one cycle.  I finally got it to
crawl and index my first webpage after many hours of searching the web, but
I am stuck on the de-duplication step and am hoping someone can help.

I did each command in the following sequence and everything went fine - I
checked dbread dump at each step to see the changes and was able to find the
indexed page in solr admin.  (Note: I could not get this to work at all when
using "-crawlId" option as with the example crawl script in 2.x - no urls
would be processed in this case and I was getting the common ~"batch id
doesn't match" error, but that's a separate issue I'll deal with next).

$bin/nutch inject $URLDIR 
$bin/nutch generate 
$bin/nutch fetch 
$bin/nutch parse -all
$bin/nutch updatedb
$bin/nutch solrindex $SOLRURL -reindex

>>output of last step:
SolrIndexerJob: starting
Adding 1 documents
SolrIndexerJob: done.

(I used reindex because indexing was giving a different error at first which
I fixed by removing the id field in solrindex-mapping.xml, so the page
already had an index mark and I wanted to be sure it would still be indexed)


However, when I tried:

$bin/nutch solrdedup $SOLRURL

I get the following error in the command window:
Exception in thread "Main Thread" java.lang.NullPointerException
        at
org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
        at
org.apache.hadoop.mapreduce.split.JobSplitWriter.writeNewSplits(JobSplitWriter.java:123)
        at
org.apache.hadoop.mapreduce.split.JobSplitWriter.createSplitFiles(JobSplitWriter.java:74)
        at
org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:968)
        at
org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979)
        at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:371)
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:382)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:393)


And in my hadoop.log file the only lines added were:

2013-04-26 16:36:42,784 INFO  solr.SolrDeleteDuplicates -
SolrDeleteDuplicates: starting...
2013-04-26 16:36:42,793 INFO  solr.SolrDeleteDuplicates -
SolrDeleteDuplicates: Solr url: [$SOLRURL]
2013-04-26 16:36:43,089 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2013-04-26 16:36:43,129 WARN  mapred.JobClient - No job jar file set.  User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).



Does anyone have any idea how to fix this?  I would greatly appreciate any
help!


In case it helps:

-plugins in my nutch-site.xml:
<property>
  <name>plugin.includes</name> 
 
<value>subcollection|protocol-httpclient|urlfilter-regex|parse-(html|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
 
  </property>

-I also have the following analyzer definition in schema.xml:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
 <analyzer>
  
  <tokenizer class="solr.WhitespaceTokenizerFactory" /> 
  <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" /> 
  
  <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="0" /> 
  <filter class="solr.LowerCaseFilterFactory" /> 
  <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc"
mode="compose" /> 
  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" /> 
  <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"
/> 
  <filter class="solr.RemoveDuplicatesTokenFilterFactory" /> 
  </analyzer>
  </fieldType>
  <fieldType name="date" class="solr.TrieDateField" omitNorms="true"
precisionStep="6" positionIncrementGap="0" /> 
  </types>









--
View this message in context: 
http://lucene.472066.n3.nabble.com/solrdedup-NullPointerException-tp4059389.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to