I think I've figured out the problem here. The issue was with an older
version of Elastic Mapreduce tools that amazon provided which was
buggy and used an older version of hadoop (0.18). Running the same
task with Hadoop 0.20 seems to have worked fine - I'm still running
some tests.

Is there a minimum required version for hadoop for Nutch 1.2?

Also - is there a way to download Nutch 1.3RC source from someplace?
The trunk contains 2.0 at the moment.

Thanks!
Viksit

On Tue, May 31, 2011 at 1:13 AM, Julien Nioche
<[email protected]> wrote:
>>
>> It appeared to me that the NPE happens during the serialization into
>> the crawldb - but I haven't been able to figure out why this happens.
>>
>
> The Fetcher does not write to the crawldb at all - this is done during the
> 'update' step. As you can see in the stacktrace, the problem arises when the
> reducer deserializes its input before processing it.
>
> Maybe you could try and run the code on the same data with a debugger to get
> a clearer picture of what's happening?
>
>
>
> On 31 May 2011 07:56, Viksit Gaur <[email protected]> wrote:
>
>> [Cross posting on user and dev since this is a possible bug]
>>
>> Hi all,
>>
>> Running Nutch 1.2 Fetcher on an Amazon EMR cluster results in an error
>> of the sort,
>>
>>
>> 2011-05-31 05:55:52,858 WARN org.apache.hadoop.mapred.TaskTracker
>> (main): Error running child
>> java.lang.RuntimeException: java.lang.NullPointerException
>>        at
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80)
>>        at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:163)
>>        at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>>        at
>> org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
>>        at
>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>>        at
>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>>        at
>> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:770)
>>        at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:710)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:228)
>>        at
>> org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
>>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:321)
>>        at
>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2216)
>> Caused by: java.lang.NullPointerException
>>        at
>> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>>        at
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:72)
>>        ... 11 more
>>
>> and happens during the Fetcher's Reduce step,
>>
>> 2011-05-31 05:56:38,078 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 66%
>> 2011-05-31 05:56:38,078 INFO org.apache.hadoop.mapred.JobClient
>> (main): Task Id : attempt_201105310525_0005_r_000000_1, Status :
>> FAILED
>> 2011-05-31 05:56:48,178 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 90%
>> 2011-05-31 05:56:53,230 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 91%
>> 2011-05-31 05:56:58,251 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 94%
>> 2011-05-31 05:57:03,271 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 96%
>> 2011-05-31 05:57:08,307 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 97%
>> 2011-05-31 05:57:13,343 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 99%
>> 2011-05-31 05:57:17,360 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 66%
>> 2011-05-31 05:57:17,360 INFO org.apache.hadoop.mapred.JobClient
>> (main): Task Id : attempt_201105310525_0005_r_000000_2, Status :
>> FAILED
>> 2011-05-31 05:57:27,440 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 90%
>> 2011-05-31 05:57:32,460 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 91%
>> 2011-05-31 05:57:33,465 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 92%
>> 2011-05-31 05:57:37,497 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 94%
>> 2011-05-31 05:57:42,517 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 96%
>> 2011-05-31 05:57:47,537 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 97%
>> 2011-05-31 05:57:48,542 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 98%
>> 2011-05-31 05:57:52,558 INFO org.apache.hadoop.mapred.JobClient
>> (main):  map 100% reduce 99%
>> 2011-05-31 05:57:56,576 ERROR org.apache.nutch.fetcher.Fetcher (main):
>> Fetcher: java.io.IOException: Job failed!
>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1113)
>>        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>>        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>
>> It appeared to me that the NPE happens during the serialization into
>> the crawldb - but I haven't been able to figure out why this happens.
>> Would anyone have ideas on this?
>>
>> Cheers
>> Viksit
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Reply via email to