Hello,
I am new to Nutch and have been trying desperately to get a basic web
crawler going using the following packages:
HBase 0.90.4
Nutch 2.2.1
Solr 4.3.0
I have Hbase running and can execute commands via terminal. I also have Solr
running and have used the schema-solr4.xml that came with Nutch 2.2.1 in
schema.xml file under the conf folder in collection1 of Solr. I even added
the field for "_version_" that is missing from the schema-solr4.xml example.
I am having trouble, though, getting Nutch to work. I can successfully
inject urls, but there seems to be an error in the Hadoop log around parsing
UTF8 characters.
Here is the contents of nutch-site.xml
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>SwirlCrawler</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
</description>
</property>
<property>
<name>http.robots.agents</name>
<value>SwirlCrawler</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>plugin.folders</name>
<value>/bin/apache-nutch-2.2.1/runtime/local/plugins</value>
</property>
</configuration>
and here is the contents of hadoop.log
2013-07-23 13:07:19,615 INFO crawl.InjectorJob - InjectorJob: Using class
org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
2013-07-23 13:07:19,662 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2013-07-23 13:07:19,882 WARN snappy.LoadSnappy - Snappy native library not
loaded
2013-07-23 13:07:20,546 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-07-23 13:07:20,739 INFO regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2013-07-23 13:07:20,988 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-07-23 13:07:20,999 INFO regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2013-07-23 13:07:21,052 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2013-07-23 13:07:21,280 INFO crawl.InjectorJob - InjectorJob: total number
of urls rejected by filters: 0
2013-07-23 13:07:21,280 INFO crawl.InjectorJob - InjectorJob: total number
of urls injected after normalization and filtering: 3
2013-07-23 13:07:21,287 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-07-23 13:07:21,287 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-07-23 13:07:21,287 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2013-07-23 13:07:21,935 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2013-07-23 13:07:22,063 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-07-23 13:07:22,064 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-07-23 13:07:22,064 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2013-07-23 13:07:22,126 INFO regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
2013-07-23 13:07:22,258 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-07-23 13:07:22,272 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2013-07-23 13:07:22,273 WARN mapred.LocalJobRunner -
job_local117641048_0002
java.lang.NullPointerException
at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
at
org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
Please help?
cheers,
BD
--
View this message in context:
http://lucene.472066.n3.nabble.com/Null-Pointer-Exception-trying-to-run-Nutch-tp4079866.html
Sent from the Nutch - User mailing list archive at Nabble.com.