There is also Andrzej's recent patch for the parse timeout which prevents Tika taking forever on some files. It has been mentioned on the list a couple of times.
On 20 July 2010 18:52, Mattmann, Chris A (388J) < [email protected]> wrote: > Hey Brad, > > Did you try increasing http.content.limit? I think 10MB-50MB might be > (semi-)OK parameters. Others (Ken Krugler, you lurking? :) ) might have some > better feel of the sweet spot for that parameter... > > Cheers, > Chris > > > On 7/20/10 11:41 AM, "brad" <[email protected]> wrote: > > Here is some more information: > As if 08:09 this morning 7/20/2010, the process has only proceeded a > little > further, basically I thinks it is hung. > > The last 3 entries in the hadoop.log file are as follows: > 2010-07-20 03:03:35,653 ERROR fetcher.Fetcher - fetcher > caught:java.lang.NullPointerException > 2010-07-20 03:03:35,653 WARN fetcher.Fetcher - Attempting to finish item > from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5b008f51 > 2010-07-20 03:03:35,653 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=58 > > I want to state up front that I am relatively new to Linux, and a complete > newbie to Nutch and Java so please take what I say with a grain of salt. > > I stumbled on java's jstack and ran it against the currently running > (apparently hung) nutch process and got a Full thread dump OpenJDK 64-Bit > Server VM (1.6.0-b09 mixed mode): > > What I found was there are 58 "FetcherThread" daemons running. Everyone > one > of them is at some point in org.apache.nutch.parse.tika.TikaParser.getParse > > The majority are in org.apache.tika.parser.video.FLVParser. > > But there are also a lot in java.util.Arrays.copyOf portion of > tika.parser.txt > at java.util.Arrays.copyOf(Arrays.java:2894) > ... > at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source) > at org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405) > ... > at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:144) > > > I have seen some issues reported on FLV issues and I feel safe in excluding > that file type. But the TXTParser I'm not so sure about. > > At this point it appears to be that Tika is the issue, BUT I'm not sure > Tika > is fully to blame. I'm concerned that part of the problem is triggered by > issues with http.content.limit? > It appears that Nutch will download a file up to the file size specified by > http.content.limit. But in the case of FLV, or PDF or any other larger > file, that will most likely result in an incomplete file that cannot be > parsed correctly. Which, if Tika did not handle correctly could result in > the exception and then the hanging of the thread. On the threads that > appear hung in the hadoop.log file, the file sizes exceed my > http.content.limit, which means an incomplete file is being downloaded and > Tika is attempting to parse it. > > Is there a way to have Nutch bypass a file if it is too big, rather than > download a truncated file that can not be parsed correctly? Which leads me > to the next question. Is there a way for Nutch to get the file size of any > file before downloading and skipping it if it is too large, rather than > downloading it and truncating it. > > > Thoughts? Options? > > Thanks > Brad > > > > > > > > > _____________________________________________ > > > > Hi, > > I have been trying a few different configurations of Nutch parameters to > > try to improve fetcher performance that goes from 20+ Urls/Second to less > > than 1 Url/Second. So I put in a value for fetcher.timelimit.mins to > have > > it terminate if it runs too long. In this case I have a fetcher process > > started 12 hours earlier that should terminate at about 2010-07-19 11:28 > > > > @ 11:28 the process shows 200 active threads and > > fetchQueues.totalSize=10000 > > 2010-07-19 11:28:32,585 INFO fetcher.Fetcher - -activeThreads=200, > > spinWaiting=0, fetchQueues.totalSize=10000 > > > > From here the process appears to begin a count down of the > > fetchQueues.totalSize=10000 to 0? > > The fetchQueues.totalSize continues to decrease in size until over 4 > hours > > later it I get the following entries > > > > 2010-07-19 15:32:55,344 INFO fetcher.Fetcher - -activeThreads=200, > > spinWaiting=0, fetchQueues.totalSize=0 > > 2010-07-19 15:33:10,256 WARN fetcher.Fetcher - Aborting with 200 hung > > threads. > > > > What is with the 200 hung threads? What did they come from? Why are > they > > hung? > > > > The fetcher continues to run and then 50 minutes later it starts what > > appears to be another count down: > > 2010-07-19 16:18:25,446 INFO fetcher.Fetcher - QueueFeeder finished: > > total 277652 records + hit by time limit :6177960 > > 2010-07-19 16:18:25,473 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=199 > > 2010-07-19 16:18:25,474 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=198 > > 2010-07-19 16:18:25,474 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=197 > > > > . > > > > It then stops a > > 2010-07-19 16:18:48,738 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=61 > > > > At this point it appears to run for another 2 and half hours comes up > with > > the next entry > > 2010-07-19 18:42:48,568 INFO plugin.PluginRepository - Plugins: looking > > in: /usr/local/nutch/plugins > > . > > 2010-07-19 18:46:15,084 INFO plugin.PluginRepository - Ontology > > Model Loader (org.apache.nutch.ontology.Ontology) > > > > Then it does the following 2 items > > 2010-07-19 18:52:16,697 WARN regex.RegexURLNormalizer - can't find rules > > for scope 'outlink', using default > > 2010-07-19 19:14:21,339 WARN regex.RegexURLNormalizer - can't find rules > > for scope 'fetcher', using default > > > > An hour and half later it comes up with the following error: > > 2010-07-19 20:44:05,614 WARN fetcher.Fetcher - Attempting to finish item > > from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5e8349a3 > > 2010-07-19 20:44:18,360 INFO fetcher.Fetcher - fetch of > > http://www.ifunia.com/download/ifunia-avchd-converter.dmg failed with: > > java.lang.NullPointerException > > 2010-07-19 20:44:18,361 ERROR fetcher.Fetcher - > > java.lang.NullPointerException > > 2010-07-19 20:44:25,596 ERROR fetcher.Fetcher - at > > java.lang.System.arraycopy(Native Method) > > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java > > :1108) > > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java > > :1025) > > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > > java.io.DataOutputStream.writeByte(DataOutputStream.java:153) > > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > > org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:263) > > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > > org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:243) > > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > > org.apache.hadoop.io.Text.write(Text.java:281) > > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > > > org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.s > > erialize(WritableSerialization.java:90) > > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > > > org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.s > > erialize(WritableSerialization.java:77) > > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:892) > > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > > > org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:4 > > 66) > > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:898) > > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:767) > > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - fetcher > > caught:java.lang.NullPointerException > > 2010-07-19 20:44:25,597 WARN fetcher.Fetcher - Attempting to finish item > > from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5e8349a3 > > 2010-07-19 20:44:25,597 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=60 > > > > It's now 21:55 and the line above is still the last line in the > hadoop.log > > file > > > > So basically over 10 hours after the fetcher.timelimit.mins was hit the > > process has still not terminated and seems to be hanging up on its > > threads. > > > > I'm not sure what should be happening here. I don't want to kill the > > process and lose the work that has been done at this point. This has > > happened in every case where I have put the fetcher.timelimit.mins in > > place. If I don't put fetcher.timelimit.mins in place I have to choose > a > > relatively small topN (100k) to get any results in a 24hour period. > > > > The configuration changes I have are very basic: > > Threads = 200 > > topN is not specified > > plugin.includes = > > > protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(basic|a > > > nchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-op > > ic|urlnormalizer-(pass|regex|basic) > > fetcher.timelimit.mins = 1440 (12 hours) > > generate.max.count = 100 > > fetcher.max.crawl.delay = 10 > > db.fetch.retry.max = 2 > > http.content.limit = 1024000 > > http.timeout = 5000 > > > > I have varied threads and generate max count, but no matter what I choose > > the process slows from 15+ urls a second in the first couple hours to > less > > that 1 url a second within 5 - 10 hours. > > > > That is why I implemented the fetcher.timelimit.mins in hopes of stopping > > the process and starting again to get back to a reasonable performance. > > But that appears to be a dead end because I can't get the process to > > terminate. At this rate, the termination is going to take longer than > the > > original fetch run time. > > > > On a side note, based on my testing I have a hunch that the issues may > > possibly be coming from tika. My original tests which ran without the > > same issues did not use tika in the plugin.includes > > > > > protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic > > > |site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(p > > ass|regex|basic) > > > > When I switched to my plugin.includes to > > > > > protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(basic|a > > > nchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-op > > ic|urlnormalizer-(pass|regex|basic) > > > > The problems started. I don't know if they are related, but it's a > hunch. > > > > Running on a Xeon X3220 @.2.4 ghz, 8 GB ram and about 1tb of diskspace > and > > Centos 5.5. 10 mbps connection. Nutch/Solr/Tomcat are the only real > > things running on the box and they are only running in support of Nutch. > > > > Your help would be appreciated! > > > > Thanks > > Brad > > > > > > > > > > > > > > > > > > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

