Re: Ignoring errors in crawl

Julien Nioche Thu, 17 Jul 2014 13:50:43 -0700

Hi Adam,

Your problem is the OutOfMemoryError, not the read timeouts. Having
timeouts won't crash the Fetcher.  How much memory do you give Nutch?


J.



On 17 July 2014 18:40, Adam Estrada <[email protected]> wrote:

> Julien and Markus,
>
> The logs report that a couple of threads hung while processing certain
> URLs. Below that was the out of memory WARNING.
>
> 2014-07-14 16:22:02,209 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.PushbackInputStream.read(PushbackInputStream.java:186)
>         at java.io.FilterInputStream.read(FilterInputStream.java:107)
>         at
> org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:293)
>         at
> org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:221)
>         at org.apache.nutch.protocol.http.Http.getResponse(Http.java:72)
>         at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:183)
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:715)
>
> 2014-07-14 16:22:43,138 INFO  fetcher.Fetcher - fetch of
> http://myurl.com failed with: java.lang.NullPointerException
>              at java.lang.System.arraycopy(Native Method)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1282)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1199)
>         at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
>         at
> org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
>         at
> org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
>         at org.apache.hadoop.io.Text.write(Text.java:281)
>         at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
>         at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1066)
>         at
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591)
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:982)
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:929)
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:784)
>
> 2014-07-14 16:22:43,138 ERROR fetcher.Fetcher - fetcher
> caught:java.lang.NullPointerException
> 2014-07-14 16:22:43,139 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2014-07-14 16:22:43,139 WARN  mapred.LocalJobRunner -
> job_local551121011_0001
> java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 2014-07-14 16:22:43,558 ERROR fetcher.Fetcher - Fetcher:
> java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1432)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441)
>
> On Thu, Jul 17, 2014 at 10:06 AM, Adam Estrada <[email protected]>
> wrote:
> > All,
> >
> > I am coming across a few pages that are not responsive at all which is
> > causing Nutch to #failwhale before finishing the current crawl. I have
> > increased http.timeout and it still crashes. How can I get Nutch to
> > skip over unresponsive URLs that are causing the entire thing to bail?
> >
> > Thanks,
> > Adam
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Ignoring errors in crawl

Reply via email to