Hi Adam,

Your problem is the OutOfMemoryError, not the read timeouts. Having
timeouts won't crash the Fetcher.  How much memory do you give Nutch?

J.



On 17 July 2014 18:40, Adam Estrada <[email protected]> wrote:

> Julien and Markus,
>
> The logs report that a couple of threads hung while processing certain
> URLs. Below that was the out of memory WARNING.
>
> 2014-07-14 16:22:02,209 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.PushbackInputStream.read(PushbackInputStream.java:186)
>         at java.io.FilterInputStream.read(FilterInputStream.java:107)
>         at
> org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:293)
>         at
> org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:221)
>         at org.apache.nutch.protocol.http.Http.getResponse(Http.java:72)
>         at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:183)
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:715)
>
> 2014-07-14 16:22:43,138 INFO  fetcher.Fetcher - fetch of
> http://myurl.com failed with: java.lang.NullPointerException
>              at java.lang.System.arraycopy(Native Method)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1282)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1199)
>         at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
>         at
> org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
>         at
> org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
>         at org.apache.hadoop.io.Text.write(Text.java:281)
>         at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
>         at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1066)
>         at
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591)
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:982)
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:929)
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:784)
>
> 2014-07-14 16:22:43,138 ERROR fetcher.Fetcher - fetcher
> caught:java.lang.NullPointerException
> 2014-07-14 16:22:43,139 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2014-07-14 16:22:43,139 WARN  mapred.LocalJobRunner -
> job_local551121011_0001
> java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 2014-07-14 16:22:43,558 ERROR fetcher.Fetcher - Fetcher:
> java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1432)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441)
>
> On Thu, Jul 17, 2014 at 10:06 AM, Adam Estrada <[email protected]>
> wrote:
> > All,
> >
> > I am coming across a few pages that are not responsive at all which is
> > causing Nutch to #failwhale before finishing the current crawl. I have
> > increased http.timeout and it still crashes. How can I get Nutch to
> > skip over unresponsive URLs that are causing the entire thing to bail?
> >
> > Thanks,
> > Adam
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to