Hey Brad, Did you try increasing http.content.limit? I think 10MB-50MB might be (semi-)OK parameters. Others (Ken Krugler, you lurking? :) ) might have some better feel of the sweet spot for that parameter...
Cheers, Chris On 7/20/10 11:41 AM, "brad" <[email protected]> wrote: Here is some more information: As if 08:09 this morning 7/20/2010, the process has only proceeded a little further, basically I thinks it is hung. The last 3 entries in the hadoop.log file are as follows: 2010-07-20 03:03:35,653 ERROR fetcher.Fetcher - fetcher caught:java.lang.NullPointerException 2010-07-20 03:03:35,653 WARN fetcher.Fetcher - Attempting to finish item from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5b008f51 2010-07-20 03:03:35,653 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=58 I want to state up front that I am relatively new to Linux, and a complete newbie to Nutch and Java so please take what I say with a grain of salt. I stumbled on java's jstack and ran it against the currently running (apparently hung) nutch process and got a Full thread dump OpenJDK 64-Bit Server VM (1.6.0-b09 mixed mode): What I found was there are 58 "FetcherThread" daemons running. Everyone one of them is at some point in org.apache.nutch.parse.tika.TikaParser.getParse The majority are in org.apache.tika.parser.video.FLVParser. But there are also a lot in java.util.Arrays.copyOf portion of tika.parser.txt at java.util.Arrays.copyOf(Arrays.java:2894) ... at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source) at org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405) ... at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:144) I have seen some issues reported on FLV issues and I feel safe in excluding that file type. But the TXTParser I'm not so sure about. At this point it appears to be that Tika is the issue, BUT I'm not sure Tika is fully to blame. I'm concerned that part of the problem is triggered by issues with http.content.limit? It appears that Nutch will download a file up to the file size specified by http.content.limit. But in the case of FLV, or PDF or any other larger file, that will most likely result in an incomplete file that cannot be parsed correctly. Which, if Tika did not handle correctly could result in the exception and then the hanging of the thread. On the threads that appear hung in the hadoop.log file, the file sizes exceed my http.content.limit, which means an incomplete file is being downloaded and Tika is attempting to parse it. Is there a way to have Nutch bypass a file if it is too big, rather than download a truncated file that can not be parsed correctly? Which leads me to the next question. Is there a way for Nutch to get the file size of any file before downloading and skipping it if it is too large, rather than downloading it and truncating it. Thoughts? Options? Thanks Brad > _____________________________________________ > > Hi, > I have been trying a few different configurations of Nutch parameters to > try to improve fetcher performance that goes from 20+ Urls/Second to less > than 1 Url/Second. So I put in a value for fetcher.timelimit.mins to have > it terminate if it runs too long. In this case I have a fetcher process > started 12 hours earlier that should terminate at about 2010-07-19 11:28 > > @ 11:28 the process shows 200 active threads and > fetchQueues.totalSize=10000 > 2010-07-19 11:28:32,585 INFO fetcher.Fetcher - -activeThreads=200, > spinWaiting=0, fetchQueues.totalSize=10000 > > From here the process appears to begin a count down of the > fetchQueues.totalSize=10000 to 0? > The fetchQueues.totalSize continues to decrease in size until over 4 hours > later it I get the following entries > > 2010-07-19 15:32:55,344 INFO fetcher.Fetcher - -activeThreads=200, > spinWaiting=0, fetchQueues.totalSize=0 > 2010-07-19 15:33:10,256 WARN fetcher.Fetcher - Aborting with 200 hung > threads. > > What is with the 200 hung threads? What did they come from? Why are they > hung? > > The fetcher continues to run and then 50 minutes later it starts what > appears to be another count down: > 2010-07-19 16:18:25,446 INFO fetcher.Fetcher - QueueFeeder finished: > total 277652 records + hit by time limit :6177960 > 2010-07-19 16:18:25,473 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=199 > 2010-07-19 16:18:25,474 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=198 > 2010-07-19 16:18:25,474 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=197 > > . > > It then stops a > 2010-07-19 16:18:48,738 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=61 > > At this point it appears to run for another 2 and half hours comes up with > the next entry > 2010-07-19 18:42:48,568 INFO plugin.PluginRepository - Plugins: looking > in: /usr/local/nutch/plugins > . > 2010-07-19 18:46:15,084 INFO plugin.PluginRepository - Ontology > Model Loader (org.apache.nutch.ontology.Ontology) > > Then it does the following 2 items > 2010-07-19 18:52:16,697 WARN regex.RegexURLNormalizer - can't find rules > for scope 'outlink', using default > 2010-07-19 19:14:21,339 WARN regex.RegexURLNormalizer - can't find rules > for scope 'fetcher', using default > > An hour and half later it comes up with the following error: > 2010-07-19 20:44:05,614 WARN fetcher.Fetcher - Attempting to finish item > from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5e8349a3 > 2010-07-19 20:44:18,360 INFO fetcher.Fetcher - fetch of > http://www.ifunia.com/download/ifunia-avchd-converter.dmg failed with: > java.lang.NullPointerException > 2010-07-19 20:44:18,361 ERROR fetcher.Fetcher - > java.lang.NullPointerException > 2010-07-19 20:44:25,596 ERROR fetcher.Fetcher - at > java.lang.System.arraycopy(Native Method) > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java > :1108) > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java > :1025) > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > java.io.DataOutputStream.writeByte(DataOutputStream.java:153) > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:263) > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:243) > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > org.apache.hadoop.io.Text.write(Text.java:281) > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.s > erialize(WritableSerialization.java:90) > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.s > erialize(WritableSerialization.java:77) > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:892) > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:4 > 66) > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:898) > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:767) > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - fetcher > caught:java.lang.NullPointerException > 2010-07-19 20:44:25,597 WARN fetcher.Fetcher - Attempting to finish item > from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5e8349a3 > 2010-07-19 20:44:25,597 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=60 > > It's now 21:55 and the line above is still the last line in the hadoop.log > file > > So basically over 10 hours after the fetcher.timelimit.mins was hit the > process has still not terminated and seems to be hanging up on its > threads. > > I'm not sure what should be happening here. I don't want to kill the > process and lose the work that has been done at this point. This has > happened in every case where I have put the fetcher.timelimit.mins in > place. If I don't put fetcher.timelimit.mins in place I have to choose a > relatively small topN (100k) to get any results in a 24hour period. > > The configuration changes I have are very basic: > Threads = 200 > topN is not specified > plugin.includes = > protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(basic|a > nchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-op > ic|urlnormalizer-(pass|regex|basic) > fetcher.timelimit.mins = 1440 (12 hours) > generate.max.count = 100 > fetcher.max.crawl.delay = 10 > db.fetch.retry.max = 2 > http.content.limit = 1024000 > http.timeout = 5000 > > I have varied threads and generate max count, but no matter what I choose > the process slows from 15+ urls a second in the first couple hours to less > that 1 url a second within 5 - 10 hours. > > That is why I implemented the fetcher.timelimit.mins in hopes of stopping > the process and starting again to get back to a reasonable performance. > But that appears to be a dead end because I can't get the process to > terminate. At this rate, the termination is going to take longer than the > original fetch run time. > > On a side note, based on my testing I have a hunch that the issues may > possibly be coming from tika. My original tests which ran without the > same issues did not use tika in the plugin.includes > > protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic > |site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(p > ass|regex|basic) > > When I switched to my plugin.includes to > > protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(basic|a > nchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-op > ic|urlnormalizer-(pass|regex|basic) > > The problems started. I don't know if they are related, but it's a hunch. > > Running on a Xeon X3220 @.2.4 ghz, 8 GB ram and about 1tb of diskspace and > Centos 5.5. 10 mbps connection. Nutch/Solr/Tomcat are the only real > things running on the box and they are only running in support of Nutch. > > Your help would be appreciated! > > Thanks > Brad > > > > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

