I ran the Fetcher with the flag -noParsing and that worked perfectly -
fetched a 1 million plus urls in under 2 hours.  

However, running the parse hangs:

  skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1"
  $NUTCH_HOME/bin/nutch parse $segment -threads $threads $skipRecordsOptions

I have tried implementing NUTCH-696 (Timeout for Parser Patch), but applying
the patch generates an error when applied to branch-1.1.




-----Original Message-----
From: Julien Nioche [mailto:[email protected]] 
Sent: Tuesday, July 20, 2010 11:32 AM
To: [email protected]
Subject: Re: Nutch 1.1: Issue Using fetcher.timelimit.mins and fetch
performance

There is also Andrzej's recent patch for the parse timeout which prevents
Tika taking forever on some files. It has been mentioned on the list a
couple of times.

On 20 July 2010 18:52, Mattmann, Chris A (388J) <
[email protected]> wrote:

> Hey Brad,
>
> Did you try increasing http.content.limit? I think 10MB-50MB might be 
> (semi-)OK parameters. Others (Ken Krugler, you lurking? :) ) might 
> have some better feel of the sweet spot for that parameter...
>
> Cheers,
> Chris
>
>
> On 7/20/10 11:41 AM, "brad" <[email protected]> wrote:
>
> Here is some more information:
> As if 08:09 this morning 7/20/2010, the  process has only proceeded a 
> little further, basically I thinks it is hung.
>
> The last 3 entries in the hadoop.log file are as follows:
> 2010-07-20 03:03:35,653 ERROR fetcher.Fetcher - fetcher 
> caught:java.lang.NullPointerException
> 2010-07-20 03:03:35,653 WARN  fetcher.Fetcher - Attempting to finish 
> item from unknown queue: 
> org.apache.nutch.fetcher.fetcher$fetchi...@5b008f51
> 2010-07-20 03:03:35,653 INFO  fetcher.Fetcher - -finishing thread 
> FetcherThread, activeThreads=58
>
> I want to state up front that I am relatively new to Linux, and a 
> complete newbie to Nutch and Java so please take what I say with a grain
of salt.
>
> I stumbled on java's jstack and ran it against the currently running 
> (apparently hung) nutch process and got a Full thread dump OpenJDK 
> 64-Bit Server VM (1.6.0-b09 mixed mode):
>
> What I found was there are 58 "FetcherThread" daemons running.  
> Everyone one of them is at some point in 
> org.apache.nutch.parse.tika.TikaParser.getParse
>
> The majority are in org.apache.tika.parser.video.FLVParser.
>
> But there are also a lot in java.util.Arrays.copyOf portion of 
> tika.parser.txt at java.util.Arrays.copyOf(Arrays.java:2894)
> ...
> at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source) 
> at 
> org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405)
> ...
> at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:144)
>
>
> I have seen some issues reported on FLV issues and I feel safe in 
> excluding that file type.  But the TXTParser I'm not so sure about.
>
> At this point it appears to be that Tika is the issue, BUT I'm not 
> sure Tika is fully to blame.  I'm concerned that part of the problem 
> is triggered by issues with http.content.limit?
> It appears that Nutch will download a file up to the file size 
> specified by http.content.limit.  But in the case of FLV, or PDF or 
> any other larger file, that will most likely result in an incomplete 
> file that cannot be parsed correctly.  Which, if Tika did not handle 
> correctly could result in the exception and then the hanging of the 
> thread.  On the threads that appear hung in the hadoop.log file, the 
> file sizes exceed my http.content.limit, which means an incomplete 
> file is being downloaded and Tika is attempting to parse it.
>
> Is there a way to have Nutch bypass a file if it is too big, rather 
> than download a truncated file that can not be parsed correctly?  
> Which leads me to the next question.  Is there a way for Nutch to get 
> the file size of any file before downloading and skipping it if it is 
> too large, rather than downloading it and truncating it.
>
>
> Thoughts? Options?
>
> Thanks
> Brad
>
>
>
>
>
>
>
> > _____________________________________________
> >
> > Hi,
> > I have been trying a few different configurations of Nutch 
> > parameters to try to improve fetcher performance that goes from 20+ 
> > Urls/Second to less than 1 Url/Second.  So I put in a value for 
> > fetcher.timelimit.mins to
> have
> > it terminate if it runs too long.  In this case I have a fetcher 
> > process started 12 hours earlier that should terminate at about 
> > 2010-07-19 11:28
> >
> > @ 11:28 the process shows 200 active threads and 
> > fetchQueues.totalSize=10000
> > 2010-07-19 11:28:32,585 INFO  fetcher.Fetcher - -activeThreads=200, 
> > spinWaiting=0, fetchQueues.totalSize=10000
> >
> > From here the process appears to begin a count down of the 
> > fetchQueues.totalSize=10000 to 0?
> > The fetchQueues.totalSize continues to decrease in size until over 4
> hours
> > later it I get the following entries
> >
> > 2010-07-19 15:32:55,344 INFO  fetcher.Fetcher - -activeThreads=200, 
> > spinWaiting=0, fetchQueues.totalSize=0
> > 2010-07-19 15:33:10,256 WARN  fetcher.Fetcher - Aborting with 200 
> > hung threads.
> >
> > What is with the 200 hung threads?  What did they come from?  Why 
> > are
> they
> > hung?
> >
> > The fetcher continues to run and then 50 minutes later it starts 
> > what appears to be another count down:
> > 2010-07-19 16:18:25,446 INFO  fetcher.Fetcher - QueueFeeder finished:
> > total 277652 records + hit by time limit :6177960
> > 2010-07-19 16:18:25,473 INFO  fetcher.Fetcher - -finishing thread 
> > FetcherThread, activeThreads=199
> > 2010-07-19 16:18:25,474 INFO  fetcher.Fetcher - -finishing thread 
> > FetcherThread, activeThreads=198
> > 2010-07-19 16:18:25,474 INFO  fetcher.Fetcher - -finishing thread 
> > FetcherThread, activeThreads=197
> >
> > .
> >
> > It then stops a
> > 2010-07-19 16:18:48,738 INFO  fetcher.Fetcher - -finishing thread 
> > FetcherThread, activeThreads=61
> >
> > At this point it appears to run for another 2 and half hours comes 
> > up
> with
> > the next entry
> > 2010-07-19 18:42:48,568 INFO  plugin.PluginRepository - Plugins: 
> > looking
> > in: /usr/local/nutch/plugins
> > .
> > 2010-07-19 18:46:15,084 INFO  plugin.PluginRepository -       Ontology
> > Model Loader (org.apache.nutch.ontology.Ontology)
> >
> > Then it does the following 2 items
> > 2010-07-19 18:52:16,697 WARN  regex.RegexURLNormalizer - can't find 
> > rules for scope 'outlink', using default
> > 2010-07-19 19:14:21,339 WARN  regex.RegexURLNormalizer - can't find 
> > rules for scope 'fetcher', using default
> >
> > An hour and half later it comes up with the following error:
> > 2010-07-19 20:44:05,614 WARN  fetcher.Fetcher - Attempting to finish 
> > item from unknown queue: 
> > org.apache.nutch.fetcher.fetcher$fetchi...@5e8349a3
> > 2010-07-19 20:44:18,360 INFO  fetcher.Fetcher - fetch of 
> > http://www.ifunia.com/download/ifunia-avchd-converter.dmg failed with:
> > java.lang.NullPointerException
> > 2010-07-19 20:44:18,361 ERROR fetcher.Fetcher - 
> > java.lang.NullPointerException
> > 2010-07-19 20:44:25,596 ERROR fetcher.Fetcher - at 
> > java.lang.System.arraycopy(Native Method)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.
> java
> > :1108)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.
> java
> > :1025)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> > java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> > org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:263
> > )
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> > org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:243)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> > org.apache.hadoop.io.Text.write(Text.java:281)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializ
> er.s
> > erialize(WritableSerialization.java:90)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializ
> er.s
> > erialize(WritableSerialization.java:77)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:
> 892)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> >
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.ja
> va:4
> > 66)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:8
> > 98)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:767)
> > 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - fetcher 
> > caught:java.lang.NullPointerException
> > 2010-07-19 20:44:25,597 WARN  fetcher.Fetcher - Attempting to finish 
> > item from unknown queue: 
> > org.apache.nutch.fetcher.fetcher$fetchi...@5e8349a3
> > 2010-07-19 20:44:25,597 INFO  fetcher.Fetcher - -finishing thread 
> > FetcherThread, activeThreads=60
> >
> > It's now 21:55 and the line above is still the last line in the
> hadoop.log
> > file
> >
> > So basically over 10 hours after the  fetcher.timelimit.mins was hit 
> > the process has still not terminated and seems to be hanging up on 
> > its threads.
> >
> > I'm not sure what should be happening here.  I don't want to kill 
> > the process and lose the work that has been done at this point.  
> > This has happened in every case where I have put the 
> > fetcher.timelimit.mins in place.  If I don't put 
> > fetcher.timelimit.mins  in place I have to choose
> a
> > relatively small topN (100k) to get any results in a 24hour period.
> >
> > The configuration changes I have are very basic:
> > Threads = 200
> > topN is not specified
> > plugin.includes =
> >
> protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(bas
> protocol-http|urlfilter-regex|ic|a
> >
> nchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scorin
> g-op
> > ic|urlnormalizer-(pass|regex|basic)
> > fetcher.timelimit.mins = 1440 (12 hours) generate.max.count = 100 
> > fetcher.max.crawl.delay = 10 db.fetch.retry.max = 2 
> > http.content.limit = 1024000 http.timeout = 5000
> >
> > I have varied threads and generate max count, but no matter what I 
> > choose the process slows from 15+ urls a second in the first couple 
> > hours to
> less
> > that 1 url a second within 5 - 10 hours.
> >
> > That is why I implemented the fetcher.timelimit.mins in hopes of 
> > stopping the process and starting again to get back to a reasonable
performance.
> > But that appears to be a dead end because I can't get the process to 
> > terminate.  At this rate, the termination is going to take longer 
> > than
> the
> > original fetch run time.
> >
> > On a side note, based on my testing I have a hunch that the issues 
> > may possibly be coming from tika.  My original tests which ran 
> > without the same issues did not use tika in the plugin.includes
> >
> >
> protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(b
> protocol-http|urlfilter-regex|parse-html|asic
> >
> |site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalize
> |site|r-(p
> > ass|regex|basic)
> >
> > When I switched to my plugin.includes to
> >
> >
> protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(bas
> protocol-http|urlfilter-regex|ic|a
> >
> nchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scorin
> g-op
> > ic|urlnormalizer-(pass|regex|basic)
> >
> > The problems started.  I don't know if they are related, but it's a
> hunch.
> >
> > Running on a Xeon X3220 @.2.4 ghz, 8 GB ram and about 1tb of 
> > diskspace
> and
> > Centos 5.5.  10 mbps connection.  Nutch/Solr/Tomcat are the only 
> > real things running on the box and they are only running in support of
Nutch.
> >
> > Your help would be appreciated!
> >
> > Thanks
> > Brad
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:
http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department University of 
> Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


--
DigitalPebble Ltd

Open Source Solutions for Text Engineering http://www.digitalpebble.com

Reply via email to