Re: Nutch 1.1: Issue Using fetcher.timelimit.mins and fetch performance

Julien Nioche Tue, 20 Jul 2010 11:33:59 -0700

There is also Andrzej's recent patch for the parse timeout which prevents
Tika taking forever on some files. It has been mentioned on the list a
couple of times.


On 20 July 2010 18:52, Mattmann, Chris A (388J) <
[email protected]> wrote:

> Hey Brad,
>
> Did you try increasing http.content.limit? I think 10MB-50MB might be
> (semi-)OK parameters. Others (Ken Krugler, you lurking? :) ) might have some
> better feel of the sweet spot for that parameter...
>
> Cheers,
> Chris
>
>
> On 7/20/10 11:41 AM, "brad" <[email protected]> wrote:
>
> Here is some more information:
> As if 08:09 this morning 7/20/2010, the  process has only proceeded a
> little
> further, basically I thinks it is hung.
>
> The last 3 entries in the hadoop.log file are as follows:
> 2010-07-20 03:03:35,653 ERROR fetcher.Fetcher - fetcher
> caught:java.lang.NullPointerException
> 2010-07-20 03:03:35,653 WARN  fetcher.Fetcher - Attempting to finish item
> from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5b008f51
> 2010-07-20 03:03:35,653 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=58
>
> I want to state up front that I am relatively new to Linux, and a complete
> newbie to Nutch and Java so please take what I say with a grain of salt.
>
> I stumbled on java's jstack and ran it against the currently running
> (apparently hung) nutch process and got a Full thread dump OpenJDK 64-Bit
> Server VM (1.6.0-b09 mixed mode):
>
> What I found was there are 58 "FetcherThread" daemons running.  Everyone
> one
> of them is at some point in org.apache.nutch.parse.tika.TikaParser.getParse
>
> The majority are in org.apache.tika.parser.video.FLVParser.
>
> But there are also a lot in java.util.Arrays.copyOf portion of
> tika.parser.txt
> at java.util.Arrays.copyOf(Arrays.java:2894)
> ...
> at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
> at org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405)
> ...
> at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:144)
>
>
> I have seen some issues reported on FLV issues and I feel safe in excluding
> that file type.  But the TXTParser I'm not so sure about.
>
> At this point it appears to be that Tika is the issue, BUT I'm not sure
> Tika
> is fully to blame.  I'm concerned that part of the problem is triggered by
> issues with http.content.limit?
> It appears that Nutch will download a file up to the file size specified by
> http.content.limit.  But in the case of FLV, or PDF or any other larger
> file, that will most likely result in an incomplete file that cannot be
> parsed correctly.  Which, if Tika did not handle correctly could result in
> the exception and then the hanging of the thread.  On the threads that
> appear hung in the hadoop.log file, the file sizes exceed my
> http.content.limit, which means an incomplete file is being downloaded and
> Tika is attempting to parse it.
>
> Is there a way to have Nutch bypass a file if it is too big, rather than
> download a truncated file that can not be parsed correctly?  Which leads me
> to the next question.  Is there a way for Nutch to get the file size of any
> file before downloading and skipping it if it is too large, rather than
> downloading it and truncating it.
>
>
> Thoughts? Options?
>
> Thanks
> Brad
>
>
>
>
>
>
>
> > _____________________________________________
> >
> > Hi,
> > I have been trying a few different configurations of Nutch parameters to
> > try to improve fetcher performance that goes from 20+ Urls/Second to less
> > than 1 Url/Second.  So I put in a value for fetcher.timelimit.mins to
> have
> > it terminate if it runs too long.  In this case I have a fetcher process
> > started 12 hours earlier that should terminate at about 2010-07-19 11:28
> >
> > @ 11:28 the process shows 200 active threads and
> > fetchQueues.totalSize=10000
> > 2010-07-19 11:28:32,585 INFO  fetcher.Fetcher - -activeThreads=200,
> > spinWaiting=0, fetchQueues.totalSize=10000
> >
> > From here the process appears to begin a count down of the
> > fetchQueues.totalSize=10000 to 0?
> > The fetchQueues.totalSize continues to decrease in size until over 4
> hours
> > later it I get the following entries
> >
> > 2010-07-19 15:32:55,344 INFO  fetcher.Fetcher - -activeThreads=200,
> > spinWaiting=0, fetchQueues.totalSize=0
> > 2010-07-19 15:33:10,256 WARN  fetcher.Fetcher - Aborting with 200 hung
> > threads.
> >
> > What is with the 200 hung threads?  What did they come from?  Why are
> they
> > hung?
> >
> > The fetcher continues to run and then 50 minutes later it starts what
> > appears to be another count down:
> > 2010-07-19 16:18:25,446 INFO  fetcher.Fetcher - QueueFeeder finished:
> > total 277652 records + hit by time limit :6177960
> > 2010-07-19 16:18:25,473 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=199
> > 2010-07-19 16:18:25,474 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=198
> > 2010-07-19 16:18:25,474 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=197
> >
> > .
> >
> > It then stops a
> > 2010-07-19 16:18:48,738 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=61
> >
> > At this point it appears to run for another 2 and half hours comes up
> with
> > the next entry
> > 2010-07-19 18:42:48,568 INFO  plugin.PluginRepository - Plugins: looking
> > in: /usr/local/nutch/plugins
> > .
> > 2010-07-19 18:46:15,084 INFO  plugin.PluginRepository -       Ontology
> > Model Loader (org.apache.nutch.ontology.Ontology)
> >
> > Then it does the following 2 items
> > 2010-07-19 18:52:16,697 WARN  regex.RegexURLNormalizer - can't find rules
> > for scope 'outlink', using default
> > 2010-07-19 19:14:21,339 WARN  regex.RegexURLNormalizer - can't find rules
> > for scope 'fetcher', using default
> >
> > An hour and half later it comes up with the following error:
> > 2010-07-19 20:44:05,614 WARN  fetcher.Fetcher - Attempting to finish item
> > from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5e8349a3
> > 2010-07-19 20:44:18,360 INFO  fetcher.Fetcher - fetch of
> > http://www.ifunia.com/download/ifunia-avchd-converter.dmg failed with:
> > java.lang.NullPointerException
> > 2010-07-19 20:44:18,361 ERROR fetcher.Fetcher -
> > java.lang.NullPointerException
> > 2010-07-19 20:44:25,596 ERROR fetcher.Fetcher - at
> > java.lang.System.arraycopy(Native Method)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java
> > :1108)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java
> > :1025)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> > java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> > org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:263)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> > org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:243)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> > org.apache.hadoop.io.Text.write(Text.java:281)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.s
> > erialize(WritableSerialization.java:90)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.s
> > erialize(WritableSerialization.java:77)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:892)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> >
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:4
> > 66)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:898)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:767)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - fetcher
> > caught:java.lang.NullPointerException
> > 2010-07-19 20:44:25,597 WARN  fetcher.Fetcher - Attempting to finish item
> > from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5e8349a3
> > 2010-07-19 20:44:25,597 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=60
> >
> > It's now 21:55 and the line above is still the last line in the
> hadoop.log
> > file
> >
> > So basically over 10 hours after the  fetcher.timelimit.mins was hit the
> > process has still not terminated and seems to be hanging up on its
> > threads.
> >
> > I'm not sure what should be happening here.  I don't want to kill the
> > process and lose the work that has been done at this point.  This has
> > happened in every case where I have put the fetcher.timelimit.mins in
> > place.  If I don't put fetcher.timelimit.mins  in place I have to choose
> a
> > relatively small topN (100k) to get any results in a 24hour period.
> >
> > The configuration changes I have are very basic:
> > Threads = 200
> > topN is not specified
> > plugin.includes =
> >
> protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(basic|a
> >
> nchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-op
> > ic|urlnormalizer-(pass|regex|basic)
> > fetcher.timelimit.mins = 1440 (12 hours)
> > generate.max.count = 100
> > fetcher.max.crawl.delay = 10
> > db.fetch.retry.max = 2
> > http.content.limit = 1024000
> > http.timeout = 5000
> >
> > I have varied threads and generate max count, but no matter what I choose
> > the process slows from 15+ urls a second in the first couple hours to
> less
> > that 1 url a second within 5 - 10 hours.
> >
> > That is why I implemented the fetcher.timelimit.mins in hopes of stopping
> > the process and starting again to get back to a reasonable performance.
> > But that appears to be a dead end because I can't get the process to
> > terminate.  At this rate, the termination is going to take longer than
> the
> > original fetch run time.
> >
> > On a side note, based on my testing I have a hunch that the issues may
> > possibly be coming from tika.  My original tests which ran without the
> > same issues did not use tika in the plugin.includes
> >
> >
> protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic
> >
> |site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(p
> > ass|regex|basic)
> >
> > When I switched to my plugin.includes to
> >
> >
> protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(basic|a
> >
> nchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-op
> > ic|urlnormalizer-(pass|regex|basic)
> >
> > The problems started.  I don't know if they are related, but it's a
> hunch.
> >
> > Running on a Xeon X3220 @.2.4 ghz, 8 GB ram and about 1tb of diskspace
> and
> > Centos 5.5.  10 mbps connection.  Nutch/Solr/Tomcat are the only real
> > things running on the box and they are only running in support of Nutch.
> >
> > Your help would be appreciated!
> >
> > Thanks
> > Brad
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: Nutch 1.1: Issue Using fetcher.timelimit.mins and fetch performance

Reply via email to