Re:Null Pointer Exception trying to run Nutch

Rui Gao Fri, 26 Jul 2013 06:38:44 -0700

Hi

I have the some software configuration and the same error under Cygwin + 
windows XP.


Some error found when using hbase 0.90.x. Here's the log:
2013-07-21 14:51:29,500 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2013-07-21 14:51:29,500 WARN  mapred.LocalJobRunner - job_local196483647_0002
java.lang.NullPointerException
    at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
    at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)






At 2013-07-24 04:20:15,band_master <[email protected]> wrote:
>Hello,
>I am new to Nutch and have been trying desperately to get a basic web
>crawler going using the following packages:
>
>HBase 0.90.4
>Nutch 2.2.1
>Solr 4.3.0
>
>I have Hbase running and can execute commands via terminal. I also have Solr
>running and have used the schema-solr4.xml that came with Nutch 2.2.1 in
>schema.xml file under the conf folder in collection1 of Solr. I even added
>the field for "_version_" that is missing from the schema-solr4.xml example.
>I am having trouble, though, getting Nutch to work. I can successfully
>inject urls, but there seems to be an error in the Hadoop log around parsing
>UTF8 characters. 
>
>Here is the contents of nutch-site.xml
>
><property>
> <name>storage.data.store.class</name>
> <value>org.apache.gora.hbase.store.HBaseStore</value>
> <description>Default class for storing data</description>
></property>
><property>
>  <name>http.agent.name</name>
>  <value>SwirlCrawler</value>
>  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>  please set this to a single word uniquely related to your organization.
>  </description>
></property>
>
><property>
>  <name>http.robots.agents</name>
>  <value>SwirlCrawler</value>
>  <description>The agent strings we'll look for in robots.txt files,
>  comma-separated, in decreasing order of precedence. You should
>  put the value of http.agent.name as the first agent name, and keep the
>  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>  </description>
></property> 
><property>
>   <name>plugin.folders</name>
>   <value>/bin/apache-nutch-2.2.1/runtime/local/plugins</value>
> </property>
></configuration>
>
>and here is the contents of hadoop.log
>
>2013-07-23 13:07:19,615 INFO  crawl.InjectorJob - InjectorJob: Using class
>org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
>2013-07-23 13:07:19,662 WARN  util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes where
>applicable
>2013-07-23 13:07:19,882 WARN  snappy.LoadSnappy - Snappy native library not
>loaded
>2013-07-23 13:07:20,546 INFO  mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>2013-07-23 13:07:20,739 INFO  regex.RegexURLNormalizer - can't find rules
>for scope 'inject', using default
>2013-07-23 13:07:20,988 INFO  mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>2013-07-23 13:07:20,999 INFO  regex.RegexURLNormalizer - can't find rules
>for scope 'inject', using default
>2013-07-23 13:07:21,052 WARN  mapred.FileOutputCommitter - Output path is
>null in cleanup
>2013-07-23 13:07:21,280 INFO  crawl.InjectorJob - InjectorJob: total number
>of urls rejected by filters: 0
>2013-07-23 13:07:21,280 INFO  crawl.InjectorJob - InjectorJob: total number
>of urls injected after normalization and filtering: 3
>2013-07-23 13:07:21,287 INFO  crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>2013-07-23 13:07:21,287 INFO  crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>2013-07-23 13:07:21,287 INFO  crawl.AbstractFetchSchedule -
>maxInterval=7776000
>2013-07-23 13:07:21,935 INFO  mapreduce.GoraRecordReader -
>gora.buffer.read.limit = 10000
>2013-07-23 13:07:22,063 INFO  crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>2013-07-23 13:07:22,064 INFO  crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>2013-07-23 13:07:22,064 INFO  crawl.AbstractFetchSchedule -
>maxInterval=7776000
>2013-07-23 13:07:22,126 INFO  regex.RegexURLNormalizer - can't find rules
>for scope 'generate_host_count', using default
>2013-07-23 13:07:22,258 INFO  mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>2013-07-23 13:07:22,272 WARN  mapred.FileOutputCommitter - Output path is
>null in cleanup
>2013-07-23 13:07:22,273 WARN  mapred.LocalJobRunner -
>job_local117641048_0002
>java.lang.NullPointerException
>       at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>       at 
> org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>
>Please help?
>
>cheers,
>BD
>
>
>
>--
>View this message in context: 
>http://lucene.472066.n3.nabble.com/Null-Pointer-Exception-trying-to-run-Nutch-tp4079866.html
>Sent from the Nutch - User mailing list archive at Nabble.com.

Re:Null Pointer Exception trying to run Nutch

Reply via email to