Julien and Markus,
The logs report that a couple of threads hung while processing certain
URLs. Below that was the out of memory WARNING.
2014-07-14 16:22:02,209 ERROR http.Http - Failed to get protocol output
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.PushbackInputStream.read(PushbackInputStream.java:186)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at
org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:293)
at
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:221)
at org.apache.nutch.protocol.http.Http.getResponse(Http.java:72)
at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:183)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:715)
2014-07-14 16:22:43,138 INFO fetcher.Fetcher - fetch of
http://myurl.com failed with: java.lang.NullPointerException
at java.lang.System.arraycopy(Native Method)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1282)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1199)
at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
at org.apache.hadoop.io.Text.write(Text.java:281)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1066)
at
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:982)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:929)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:784)
2014-07-14 16:22:43,138 ERROR fetcher.Fetcher - fetcher
caught:java.lang.NullPointerException
2014-07-14 16:22:43,139 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-07-14 16:22:43,139 WARN mapred.LocalJobRunner - job_local551121011_0001
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.OutOfMemoryError: Java heap space
2014-07-14 16:22:43,558 ERROR fetcher.Fetcher - Fetcher:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1432)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441)
On Thu, Jul 17, 2014 at 10:06 AM, Adam Estrada <[email protected]> wrote:
> All,
>
> I am coming across a few pages that are not responsive at all which is
> causing Nutch to #failwhale before finishing the current crawl. I have
> increased http.timeout and it still crashes. How can I get Nutch to
> skip over unresponsive URLs that are causing the entire thing to bail?
>
> Thanks,
> Adam