Hi, I have been trying a few different configurations of Nutch parameters to try to improve fetcher performance that goes from 20+ Urls/Second to less than 1 Url/Second. So I put in a value for fetcher.timelimit.mins to have it terminate if it runs too long. In this case I have a fetcher process started 12 hours earlier that should terminate at about 2010-07-19 11:28
@ 11:28 the process shows 200 active threads and fetchQueues.totalSize=10000 2010-07-19 11:28:32,585 INFO fetcher.Fetcher - -activeThreads=200, spinWaiting=0, fetchQueues.totalSize=10000 >From here the process appears to begin a count down of the fetchQueues.totalSize=10000 to 0? The fetchQueues.totalSize continues to decrease in size until over 4 hours later it I get the following entries 2010-07-19 15:32:55,344 INFO fetcher.Fetcher - -activeThreads=200, spinWaiting=0, fetchQueues.totalSize=0 2010-07-19 15:33:10,256 WARN fetcher.Fetcher - Aborting with 200 hung threads. What is with the 200 hung threads? What did they come from? Why are they hung? The fetcher continues to run and then 50 minutes later it starts what appears to be another count down: 2010-07-19 16:18:25,446 INFO fetcher.Fetcher - QueueFeeder finished: total 277652 records + hit by time limit :6177960 2010-07-19 16:18:25,473 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=199 2010-07-19 16:18:25,474 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=198 2010-07-19 16:18:25,474 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=197 . It then stops a 2010-07-19 16:18:48,738 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=61 At this point it appears to run for another 2 and half hours comes up with the next entry 2010-07-19 18:42:48,568 INFO plugin.PluginRepository - Plugins: looking in: /usr/local/nutch/plugins . 2010-07-19 18:46:15,084 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology) Then it does the following 2 items 2010-07-19 18:52:16,697 WARN regex.RegexURLNormalizer - can't find rules for scope 'outlink', using default 2010-07-19 19:14:21,339 WARN regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default An hour and half later it comes up with the following error: 2010-07-19 20:44:05,614 WARN fetcher.Fetcher - Attempting to finish item from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5e8349a3 2010-07-19 20:44:18,360 INFO fetcher.Fetcher - fetch of http://www.ifunia.com/download/ifunia-avchd-converter.dmg failed with: java.lang.NullPointerException 2010-07-19 20:44:18,361 ERROR fetcher.Fetcher - java.lang.NullPointerException 2010-07-19 20:44:25,596 ERROR fetcher.Fetcher - at java.lang.System.arraycopy(Native Method) 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1 108) 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1 025) 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at java.io.DataOutputStream.writeByte(DataOutputStream.java:153) 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:263) 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:243) 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at org.apache.hadoop.io.Text.write(Text.java:281) 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.ser ialize(WritableSerialization.java:90) 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.ser ialize(WritableSerialization.java:77) 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:892) 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466 ) 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:898) 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:767) 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - fetcher caught:java.lang.NullPointerException 2010-07-19 20:44:25,597 WARN fetcher.Fetcher - Attempting to finish item from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5e8349a3 2010-07-19 20:44:25,597 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=60 It's now 21:55 and the line above is still the last line in the hadoop.log file So basically over 10 hours after the fetcher.timelimit.mins was hit the process has still not terminated and seems to be hanging up on its threads. I'm not sure what should be happening here. I don't want to kill the process and lose the work that has been done at this point. This has happened in every case where I have put the fetcher.timelimit.mins in place. If I don't put fetcher.timelimit.mins in place I have to choose a relatively small topN (100k) to get any results in a 24hour period. The configuration changes I have are very basic: Threads = 200 topN is not specified plugin.includes = protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(basic|anc hor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|u rlnormalizer-(pass|regex|basic) fetcher.timelimit.mins = 1440 (12 hours) generate.max.count = 100 fetcher.max.crawl.delay = 10 db.fetch.retry.max = 2 http.content.limit = 1024000 http.timeout = 5000 I have varied threads and generate max count, but no matter what I choose the process slows from 15+ urls a second in the first couple hours to less that 1 url a second within 5 - 10 hours. That is why I implemented the fetcher.timelimit.mins in hopes of stopping the process and starting again to get back to a reasonable performance. But that appears to be a dead end because I can't get the process to terminate. At this rate, the termination is going to take longer than the original fetch run time. On a side note, based on my testing I have a hunch that the issues may possibly be coming from tika. My original tests which ran without the same issues did not use tika in the plugin.includes protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|s ite|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass| regex|basic) When I switched to my plugin.includes to protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(basic|anc hor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|u rlnormalizer-(pass|regex|basic) The problems started. I don't know if they are related, but it's a hunch. Running on a Xeon X3220 @.2.4 ghz, 8 GB ram and about 1tb of diskspace and Centos 5.5. 10 mbps connection. Nutch/Solr/Tomcat are the only real things running on the box and they are only running in support of Nutch. Your help would be appreciated! Thanks Brad

