Hey Brad,

Did you try increasing http.content.limit? I think 10MB-50MB might be (semi-)OK 
parameters. Others (Ken Krugler, you lurking? :) ) might have some better feel 
of the sweet spot for that parameter...

Cheers,
Chris


On 7/20/10 11:41 AM, "brad" <[email protected]> wrote:

Here is some more information:
As if 08:09 this morning 7/20/2010, the  process has only proceeded a little
further, basically I thinks it is hung.

The last 3 entries in the hadoop.log file are as follows:
2010-07-20 03:03:35,653 ERROR fetcher.Fetcher - fetcher
caught:java.lang.NullPointerException
2010-07-20 03:03:35,653 WARN  fetcher.Fetcher - Attempting to finish item
from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5b008f51
2010-07-20 03:03:35,653 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=58

I want to state up front that I am relatively new to Linux, and a complete
newbie to Nutch and Java so please take what I say with a grain of salt.

I stumbled on java's jstack and ran it against the currently running
(apparently hung) nutch process and got a Full thread dump OpenJDK 64-Bit
Server VM (1.6.0-b09 mixed mode):

What I found was there are 58 "FetcherThread" daemons running.  Everyone one
of them is at some point in org.apache.nutch.parse.tika.TikaParser.getParse

The majority are in org.apache.tika.parser.video.FLVParser.

But there are also a lot in java.util.Arrays.copyOf portion of
tika.parser.txt
at java.util.Arrays.copyOf(Arrays.java:2894)
...
at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
at org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405)
...
at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:144)


I have seen some issues reported on FLV issues and I feel safe in excluding
that file type.  But the TXTParser I'm not so sure about.

At this point it appears to be that Tika is the issue, BUT I'm not sure Tika
is fully to blame.  I'm concerned that part of the problem is triggered by
issues with http.content.limit?
It appears that Nutch will download a file up to the file size specified by
http.content.limit.  But in the case of FLV, or PDF or any other larger
file, that will most likely result in an incomplete file that cannot be
parsed correctly.  Which, if Tika did not handle correctly could result in
the exception and then the hanging of the thread.  On the threads that
appear hung in the hadoop.log file, the file sizes exceed my
http.content.limit, which means an incomplete file is being downloaded and
Tika is attempting to parse it.

Is there a way to have Nutch bypass a file if it is too big, rather than
download a truncated file that can not be parsed correctly?  Which leads me
to the next question.  Is there a way for Nutch to get the file size of any
file before downloading and skipping it if it is too large, rather than
downloading it and truncating it.


Thoughts? Options?

Thanks
Brad







> _____________________________________________
>
> Hi,
> I have been trying a few different configurations of Nutch parameters to
> try to improve fetcher performance that goes from 20+ Urls/Second to less
> than 1 Url/Second.  So I put in a value for fetcher.timelimit.mins to have
> it terminate if it runs too long.  In this case I have a fetcher process
> started 12 hours earlier that should terminate at about 2010-07-19 11:28
>
> @ 11:28 the process shows 200 active threads and
> fetchQueues.totalSize=10000
> 2010-07-19 11:28:32,585 INFO  fetcher.Fetcher - -activeThreads=200,
> spinWaiting=0, fetchQueues.totalSize=10000
>
> From here the process appears to begin a count down of the
> fetchQueues.totalSize=10000 to 0?
> The fetchQueues.totalSize continues to decrease in size until over 4 hours
> later it I get the following entries
>
> 2010-07-19 15:32:55,344 INFO  fetcher.Fetcher - -activeThreads=200,
> spinWaiting=0, fetchQueues.totalSize=0
> 2010-07-19 15:33:10,256 WARN  fetcher.Fetcher - Aborting with 200 hung
> threads.
>
> What is with the 200 hung threads?  What did they come from?  Why are they
> hung?
>
> The fetcher continues to run and then 50 minutes later it starts what
> appears to be another count down:
> 2010-07-19 16:18:25,446 INFO  fetcher.Fetcher - QueueFeeder finished:
> total 277652 records + hit by time limit :6177960
> 2010-07-19 16:18:25,473 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=199
> 2010-07-19 16:18:25,474 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=198
> 2010-07-19 16:18:25,474 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=197
>
> .
>
> It then stops a
> 2010-07-19 16:18:48,738 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=61
>
> At this point it appears to run for another 2 and half hours comes up with
> the next entry
> 2010-07-19 18:42:48,568 INFO  plugin.PluginRepository - Plugins: looking
> in: /usr/local/nutch/plugins
> .
> 2010-07-19 18:46:15,084 INFO  plugin.PluginRepository -       Ontology
> Model Loader (org.apache.nutch.ontology.Ontology)
>
> Then it does the following 2 items
> 2010-07-19 18:52:16,697 WARN  regex.RegexURLNormalizer - can't find rules
> for scope 'outlink', using default
> 2010-07-19 19:14:21,339 WARN  regex.RegexURLNormalizer - can't find rules
> for scope 'fetcher', using default
>
> An hour and half later it comes up with the following error:
> 2010-07-19 20:44:05,614 WARN  fetcher.Fetcher - Attempting to finish item
> from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5e8349a3
> 2010-07-19 20:44:18,360 INFO  fetcher.Fetcher - fetch of
> http://www.ifunia.com/download/ifunia-avchd-converter.dmg failed with:
> java.lang.NullPointerException
> 2010-07-19 20:44:18,361 ERROR fetcher.Fetcher -
> java.lang.NullPointerException
> 2010-07-19 20:44:25,596 ERROR fetcher.Fetcher - at
> java.lang.System.arraycopy(Native Method)
> 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java
> :1108)
> 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java
> :1025)
> 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
> 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:263)
> 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:243)
> 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> org.apache.hadoop.io.Text.write(Text.java:281)
> 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.s
> erialize(WritableSerialization.java:90)
> 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.s
> erialize(WritableSerialization.java:77)
> 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:892)
> 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:4
> 66)
> 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:898)
> 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:767)
> 2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - fetcher
> caught:java.lang.NullPointerException
> 2010-07-19 20:44:25,597 WARN  fetcher.Fetcher - Attempting to finish item
> from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5e8349a3
> 2010-07-19 20:44:25,597 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=60
>
> It's now 21:55 and the line above is still the last line in the hadoop.log
> file
>
> So basically over 10 hours after the  fetcher.timelimit.mins was hit the
> process has still not terminated and seems to be hanging up on its
> threads.
>
> I'm not sure what should be happening here.  I don't want to kill the
> process and lose the work that has been done at this point.  This has
> happened in every case where I have put the fetcher.timelimit.mins in
> place.  If I don't put fetcher.timelimit.mins  in place I have to choose a
> relatively small topN (100k) to get any results in a 24hour period.
>
> The configuration changes I have are very basic:
> Threads = 200
> topN is not specified
> plugin.includes =
> protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(basic|a
> nchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-op
> ic|urlnormalizer-(pass|regex|basic)
> fetcher.timelimit.mins = 1440 (12 hours)
> generate.max.count = 100
> fetcher.max.crawl.delay = 10
> db.fetch.retry.max = 2
> http.content.limit = 1024000
> http.timeout = 5000
>
> I have varied threads and generate max count, but no matter what I choose
> the process slows from 15+ urls a second in the first couple hours to less
> that 1 url a second within 5 - 10 hours.
>
> That is why I implemented the fetcher.timelimit.mins in hopes of stopping
> the process and starting again to get back to a reasonable performance.
> But that appears to be a dead end because I can't get the process to
> terminate.  At this rate, the termination is going to take longer than the
> original fetch run time.
>
> On a side note, based on my testing I have a hunch that the issues may
> possibly be coming from tika.  My original tests which ran without the
> same issues did not use tika in the plugin.includes
>
> protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic
> |site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(p
> ass|regex|basic)
>
> When I switched to my plugin.includes to
>
> protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(basic|a
> nchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-op
> ic|urlnormalizer-(pass|regex|basic)
>
> The problems started.  I don't know if they are related, but it's a hunch.
>
> Running on a Xeon X3220 @.2.4 ghz, 8 GB ram and about 1tb of diskspace and
> Centos 5.5.  10 mbps connection.  Nutch/Solr/Tomcat are the only real
> things running on the box and they are only running in support of Nutch.
>
> Your help would be appreciated!
>
> Thanks
> Brad
>
>
>
>
>
>
>
>
>
>



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to