Hello,
I am new to Nutch and have been trying desperately to get a basic web
crawler going using the following packages:

HBase 0.90.4
Nutch 2.2.1
Solr 4.3.0

I have Hbase running and can execute commands via terminal. I also have Solr
running and have used the schema-solr4.xml that came with Nutch 2.2.1 in
schema.xml file under the conf folder in collection1 of Solr. I even added
the field for "_version_" that is missing from the schema-solr4.xml example.
I am having trouble, though, getting Nutch to work. I can successfully
inject urls, but there seems to be an error in the Hadoop log around parsing
UTF8 characters. 

Here is the contents of nutch-site.xml

<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>
<property>
  <name>http.agent.name</name>
  <value>SwirlCrawler</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.
  </description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>SwirlCrawler</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property> 
<property>
   <name>plugin.folders</name>
   <value>/bin/apache-nutch-2.2.1/runtime/local/plugins</value>
 </property>
</configuration>

and here is the contents of hadoop.log

2013-07-23 13:07:19,615 INFO  crawl.InjectorJob - InjectorJob: Using class
org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
2013-07-23 13:07:19,662 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2013-07-23 13:07:19,882 WARN  snappy.LoadSnappy - Snappy native library not
loaded
2013-07-23 13:07:20,546 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-07-23 13:07:20,739 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2013-07-23 13:07:20,988 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-07-23 13:07:20,999 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2013-07-23 13:07:21,052 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2013-07-23 13:07:21,280 INFO  crawl.InjectorJob - InjectorJob: total number
of urls rejected by filters: 0
2013-07-23 13:07:21,280 INFO  crawl.InjectorJob - InjectorJob: total number
of urls injected after normalization and filtering: 3
2013-07-23 13:07:21,287 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-07-23 13:07:21,287 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-07-23 13:07:21,287 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2013-07-23 13:07:21,935 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2013-07-23 13:07:22,063 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-07-23 13:07:22,064 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-07-23 13:07:22,064 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2013-07-23 13:07:22,126 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
2013-07-23 13:07:22,258 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-07-23 13:07:22,272 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2013-07-23 13:07:22,273 WARN  mapred.LocalJobRunner -
job_local117641048_0002
java.lang.NullPointerException
        at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
        at 
org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
        at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)

Please help?

cheers,
BD



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Null-Pointer-Exception-trying-to-run-Nutch-tp4079866.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to