I just found out this was logged by Markus many moons ago https://issues.apache.org/jira/browse/NUTCH-992 It would be nice if you could update this Jira issue with any progress you are able to make on it. I am not able to help right now sorry. Lewis
On Fri, Apr 26, 2013 at 2:14 PM, brian4 <[email protected]> wrote: > I have nutch 2.1 with hbase 0.90.6 and solr 3.6 and have been stepping > through the basic crawl process for just one cycle. I finally got it to > crawl and index my first webpage after many hours of searching the web, but > I am stuck on the de-duplication step and am hoping someone can help. > > I did each command in the following sequence and everything went fine - I > checked dbread dump at each step to see the changes and was able to find > the > indexed page in solr admin. (Note: I could not get this to work at all > when > using "-crawlId" option as with the example crawl script in 2.x - no urls > would be processed in this case and I was getting the common ~"batch id > doesn't match" error, but that's a separate issue I'll deal with next). > > $bin/nutch inject $URLDIR > $bin/nutch generate > $bin/nutch fetch > $bin/nutch parse -all > $bin/nutch updatedb > $bin/nutch solrindex $SOLRURL -reindex > > >>output of last step: > SolrIndexerJob: starting > Adding 1 documents > SolrIndexerJob: done. > > (I used reindex because indexing was giving a different error at first > which > I fixed by removing the id field in solrindex-mapping.xml, so the page > already had an index mark and I wanted to be sure it would still be > indexed) > > > However, when I tried: > > $bin/nutch solrdedup $SOLRURL > > I get the following error in the command window: > Exception in thread "Main Thread" java.lang.NullPointerException > at > > org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73) > at > > org.apache.hadoop.mapreduce.split.JobSplitWriter.writeNewSplits(JobSplitWriter.java:123) > at > > org.apache.hadoop.mapreduce.split.JobSplitWriter.createSplitFiles(JobSplitWriter.java:74) > at > org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:968) > at > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979) > at > org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:500) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530) > at > > org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:371) > at > > org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:382) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > > org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:393) > > > And in my hadoop.log file the only lines added were: > > 2013-04-26 16:36:42,784 INFO solr.SolrDeleteDuplicates - > SolrDeleteDuplicates: starting... > 2013-04-26 16:36:42,793 INFO solr.SolrDeleteDuplicates - > SolrDeleteDuplicates: Solr url: [$SOLRURL] > 2013-04-26 16:36:43,089 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2013-04-26 16:36:43,129 WARN mapred.JobClient - No job jar file set. User > classes may not be found. See JobConf(Class) or JobConf#setJar(String). > > > > Does anyone have any idea how to fix this? I would greatly appreciate any > help! > > > In case it helps: > > -plugins in my nutch-site.xml: > <property> > <name>plugin.includes</name> > > > <value>subcollection|protocol-httpclient|urlfilter-regex|parse-(html|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > </property> > > -I also have the following analyzer definition in schema.xml: > <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> > <analyzer> > > <tokenizer class="solr.WhitespaceTokenizerFactory" /> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" /> > > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="1" catenateWords="1" catenateNumbers="1" > catenateAll="0" splitOnCaseChange="0" /> > <filter class="solr.LowerCaseFilterFactory" /> > <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc" > mode="compose" /> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="false" /> > <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" > /> > <filter class="solr.RemoveDuplicatesTokenFilterFactory" /> > </analyzer> > </fieldType> > <fieldType name="date" class="solr.TrieDateField" omitNorms="true" > precisionStep="6" positionIncrementGap="0" /> > </types> > > > > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/solrdedup-NullPointerException-tp4059389.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- *Lewis*

