Re: Ignoring errors in crawl

Adam Estrada Thu, 17 Jul 2014 10:41:34 -0700

Julien and Markus,

The logs report that a couple of threads hung while processing certain
URLs. Below that was the out of memory WARNING.


2014-07-14 16:22:02,209 ERROR http.Http - Failed to get protocol output
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:152)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.PushbackInputStream.read(PushbackInputStream.java:186)
        at java.io.FilterInputStream.read(FilterInputStream.java:107)
        at 
org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:293)
        at 
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:221)
        at org.apache.nutch.protocol.http.Http.getResponse(Http.java:72)
        at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:183)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:715)

2014-07-14 16:22:43,138 INFO  fetcher.Fetcher - fetch of
http://myurl.com failed with: java.lang.NullPointerException
             at java.lang.System.arraycopy(Native Method)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1282)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1199)
        at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
        at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
        at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
        at org.apache.hadoop.io.Text.write(Text.java:281)
        at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
        at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1066)
        at 
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:982)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:929)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:784)

2014-07-14 16:22:43,138 ERROR fetcher.Fetcher - fetcher
caught:java.lang.NullPointerException
2014-07-14 16:22:43,139 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-07-14 16:22:43,139 WARN  mapred.LocalJobRunner - job_local551121011_0001
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.OutOfMemoryError: Java heap space
2014-07-14 16:22:43,558 ERROR fetcher.Fetcher - Fetcher:
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1432)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441)

On Thu, Jul 17, 2014 at 10:06 AM, Adam Estrada <[email protected]> wrote:
> All,
>
> I am coming across a few pages that are not responsive at all which is
> causing Nutch to #failwhale before finishing the current crawl. I have
> increased http.timeout and it still crashes. How can I get Nutch to
> skip over unresponsive URLs that are causing the entire thing to bail?
>
> Thanks,
> Adam

Re: Ignoring errors in crawl

Reply via email to