Hi Adam, Your problem is the OutOfMemoryError, not the read timeouts. Having timeouts won't crash the Fetcher. How much memory do you give Nutch?
J. On 17 July 2014 18:40, Adam Estrada <[email protected]> wrote: > Julien and Markus, > > The logs report that a couple of threads hung while processing certain > URLs. Below that was the out of memory WARNING. > > 2014-07-14 16:22:02,209 ERROR http.Http - Failed to get protocol output > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:152) > at java.net.SocketInputStream.read(SocketInputStream.java:122) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:273) > at java.io.BufferedInputStream.read(BufferedInputStream.java:334) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.PushbackInputStream.read(PushbackInputStream.java:186) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at > org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:293) > at > org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:221) > at org.apache.nutch.protocol.http.Http.getResponse(Http.java:72) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:183) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:715) > > 2014-07-14 16:22:43,138 INFO fetcher.Fetcher - fetch of > http://myurl.com failed with: java.lang.NullPointerException > at java.lang.System.arraycopy(Native Method) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1282) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1199) > at java.io.DataOutputStream.writeByte(DataOutputStream.java:153) > at > org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264) > at > org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244) > at org.apache.hadoop.io.Text.write(Text.java:281) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1066) > at > org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:982) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:929) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:784) > > 2014-07-14 16:22:43,138 ERROR fetcher.Fetcher - fetcher > caught:java.lang.NullPointerException > 2014-07-14 16:22:43,139 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2014-07-14 16:22:43,139 WARN mapred.LocalJobRunner - > job_local551121011_0001 > java.lang.Exception: java.lang.OutOfMemoryError: Java heap space > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) > Caused by: java.lang.OutOfMemoryError: Java heap space > 2014-07-14 16:22:43,558 ERROR fetcher.Fetcher - Fetcher: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1432) > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441) > > On Thu, Jul 17, 2014 at 10:06 AM, Adam Estrada <[email protected]> > wrote: > > All, > > > > I am coming across a few pages that are not responsive at all which is > > causing Nutch to #failwhale before finishing the current crawl. I have > > increased http.timeout and it still crashes. How can I get Nutch to > > skip over unresponsive URLs that are causing the entire thing to bail? > > > > Thanks, > > Adam > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

