Hi
I have the some software configuration and the same error under Cygwin +
windows XP.
Some error found when using hbase 0.90.x. Here's the log:
2013-07-21 14:51:29,500 WARN mapred.FileOutputCommitter - Output path is null
in cleanup
2013-07-21 14:51:29,500 WARN mapred.LocalJobRunner - job_local196483647_0002
java.lang.NullPointerException
at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
At 2013-07-24 04:20:15,band_master <[email protected]> wrote:
>Hello,
>I am new to Nutch and have been trying desperately to get a basic web
>crawler going using the following packages:
>
>HBase 0.90.4
>Nutch 2.2.1
>Solr 4.3.0
>
>I have Hbase running and can execute commands via terminal. I also have Solr
>running and have used the schema-solr4.xml that came with Nutch 2.2.1 in
>schema.xml file under the conf folder in collection1 of Solr. I even added
>the field for "_version_" that is missing from the schema-solr4.xml example.
>I am having trouble, though, getting Nutch to work. I can successfully
>inject urls, but there seems to be an error in the Hadoop log around parsing
>UTF8 characters.
>
>Here is the contents of nutch-site.xml
>
><property>
> <name>storage.data.store.class</name>
> <value>org.apache.gora.hbase.store.HBaseStore</value>
> <description>Default class for storing data</description>
></property>
><property>
> <name>http.agent.name</name>
> <value>SwirlCrawler</value>
> <description>HTTP 'User-Agent' request header. MUST NOT be empty -
> please set this to a single word uniquely related to your organization.
> </description>
></property>
>
><property>
> <name>http.robots.agents</name>
> <value>SwirlCrawler</value>
> <description>The agent strings we'll look for in robots.txt files,
> comma-separated, in decreasing order of precedence. You should
> put the value of http.agent.name as the first agent name, and keep the
> default * at the end of the list. E.g.: BlurflDev,Blurfl,*
> </description>
></property>
><property>
> <name>plugin.folders</name>
> <value>/bin/apache-nutch-2.2.1/runtime/local/plugins</value>
> </property>
></configuration>
>
>and here is the contents of hadoop.log
>
>2013-07-23 13:07:19,615 INFO crawl.InjectorJob - InjectorJob: Using class
>org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
>2013-07-23 13:07:19,662 WARN util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes where
>applicable
>2013-07-23 13:07:19,882 WARN snappy.LoadSnappy - Snappy native library not
>loaded
>2013-07-23 13:07:20,546 INFO mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>2013-07-23 13:07:20,739 INFO regex.RegexURLNormalizer - can't find rules
>for scope 'inject', using default
>2013-07-23 13:07:20,988 INFO mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>2013-07-23 13:07:20,999 INFO regex.RegexURLNormalizer - can't find rules
>for scope 'inject', using default
>2013-07-23 13:07:21,052 WARN mapred.FileOutputCommitter - Output path is
>null in cleanup
>2013-07-23 13:07:21,280 INFO crawl.InjectorJob - InjectorJob: total number
>of urls rejected by filters: 0
>2013-07-23 13:07:21,280 INFO crawl.InjectorJob - InjectorJob: total number
>of urls injected after normalization and filtering: 3
>2013-07-23 13:07:21,287 INFO crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>2013-07-23 13:07:21,287 INFO crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>2013-07-23 13:07:21,287 INFO crawl.AbstractFetchSchedule -
>maxInterval=7776000
>2013-07-23 13:07:21,935 INFO mapreduce.GoraRecordReader -
>gora.buffer.read.limit = 10000
>2013-07-23 13:07:22,063 INFO crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>2013-07-23 13:07:22,064 INFO crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>2013-07-23 13:07:22,064 INFO crawl.AbstractFetchSchedule -
>maxInterval=7776000
>2013-07-23 13:07:22,126 INFO regex.RegexURLNormalizer - can't find rules
>for scope 'generate_host_count', using default
>2013-07-23 13:07:22,258 INFO mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>2013-07-23 13:07:22,272 WARN mapred.FileOutputCommitter - Output path is
>null in cleanup
>2013-07-23 13:07:22,273 WARN mapred.LocalJobRunner -
>job_local117641048_0002
>java.lang.NullPointerException
> at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
> at
> org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
> at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>
>Please help?
>
>cheers,
>BD
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Null-Pointer-Exception-trying-to-run-Nutch-tp4079866.html
>Sent from the Nutch - User mailing list archive at Nabble.com.