I have nutch 2.1 with hbase 0.90.6 and solr 3.6 and have been stepping
through the basic crawl process for just one cycle. I finally got it to
crawl and index my first webpage after many hours of searching the web, but
I am stuck on the de-duplication step and am hoping someone can help.
I did each command in the following sequence and everything went fine - I
checked dbread dump at each step to see the changes and was able to find the
indexed page in solr admin. (Note: I could not get this to work at all when
using "-crawlId" option as with the example crawl script in 2.x - no urls
would be processed in this case and I was getting the common ~"batch id
doesn't match" error, but that's a separate issue I'll deal with next).
$bin/nutch inject $URLDIR
$bin/nutch generate
$bin/nutch fetch
$bin/nutch parse -all
$bin/nutch updatedb
$bin/nutch solrindex $SOLRURL -reindex
>>output of last step:
SolrIndexerJob: starting
Adding 1 documents
SolrIndexerJob: done.
(I used reindex because indexing was giving a different error at first which
I fixed by removing the id field in solrindex-mapping.xml, so the page
already had an index mark and I wanted to be sure it would still be indexed)
However, when I tried:
$bin/nutch solrdedup $SOLRURL
I get the following error in the command window:
Exception in thread "Main Thread" java.lang.NullPointerException
at
org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at
org.apache.hadoop.mapreduce.split.JobSplitWriter.writeNewSplits(JobSplitWriter.java:123)
at
org.apache.hadoop.mapreduce.split.JobSplitWriter.createSplitFiles(JobSplitWriter.java:74)
at
org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:968)
at
org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:371)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:382)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:393)
And in my hadoop.log file the only lines added were:
2013-04-26 16:36:42,784 INFO solr.SolrDeleteDuplicates -
SolrDeleteDuplicates: starting...
2013-04-26 16:36:42,793 INFO solr.SolrDeleteDuplicates -
SolrDeleteDuplicates: Solr url: [$SOLRURL]
2013-04-26 16:36:43,089 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2013-04-26 16:36:43,129 WARN mapred.JobClient - No job jar file set. User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
Does anyone have any idea how to fix this? I would greatly appreciate any
help!
In case it helps:
-plugins in my nutch-site.xml:
<property>
<name>plugin.includes</name>
<value>subcollection|protocol-httpclient|urlfilter-regex|parse-(html|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
-I also have the following analyzer definition in schema.xml:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="0" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ICUNormalizer2FilterFactory" name="nfkc"
mode="compose" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"
/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="date" class="solr.TrieDateField" omitNorms="true"
precisionStep="6" positionIncrementGap="0" />
</types>
--
View this message in context:
http://lucene.472066.n3.nabble.com/solrdedup-NullPointerException-tp4059389.html
Sent from the Nutch - User mailing list archive at Nabble.com.