Nutch 1.1: Issue Using fetcher.timelimit.mins and fetch performance

brad Mon, 19 Jul 2010 22:06:27 -0700

Hi,
I have been trying a few different configurations of Nutch parameters to try
to improve fetcher performance that goes from 20+ Urls/Second to less than 1
Url/Second.  So I put in a value for fetcher.timelimit.mins to have it
terminate if it runs too long.  In this case I have a fetcher process
started 12 hours earlier that should terminate at about 2010-07-19 11:28


@ 11:28 the process shows 200 active threads and fetchQueues.totalSize=10000
2010-07-19 11:28:32,585 INFO  fetcher.Fetcher - -activeThreads=200,
spinWaiting=0, fetchQueues.totalSize=10000

>From here the process appears to begin a count down of the
fetchQueues.totalSize=10000 to 0?
The fetchQueues.totalSize continues to decrease in size until over 4 hours
later it I get the following entries

2010-07-19 15:32:55,344 INFO  fetcher.Fetcher - -activeThreads=200,
spinWaiting=0, fetchQueues.totalSize=0
2010-07-19 15:33:10,256 WARN  fetcher.Fetcher - Aborting with 200 hung
threads.

What is with the 200 hung threads?  What did they come from?  Why are they
hung?

The fetcher continues to run and then 50 minutes later it starts what
appears to be another count down:
2010-07-19 16:18:25,446 INFO  fetcher.Fetcher - QueueFeeder finished: total
277652 records + hit by time limit :6177960
2010-07-19 16:18:25,473 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=199
2010-07-19 16:18:25,474 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=198
2010-07-19 16:18:25,474 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=197

.

It then stops a 
2010-07-19 16:18:48,738 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=61

At this point it appears to run for another 2 and half hours comes up with
the next entry
2010-07-19 18:42:48,568 INFO  plugin.PluginRepository - Plugins: looking in:
/usr/local/nutch/plugins
.
2010-07-19 18:46:15,084 INFO  plugin.PluginRepository -         Ontology
Model Loader (org.apache.nutch.ontology.Ontology)

Then it does the following 2 items
2010-07-19 18:52:16,697 WARN  regex.RegexURLNormalizer - can't find rules
for scope 'outlink', using default
2010-07-19 19:14:21,339 WARN  regex.RegexURLNormalizer - can't find rules
for scope 'fetcher', using default

An hour and half later it comes up with the following error:
2010-07-19 20:44:05,614 WARN  fetcher.Fetcher - Attempting to finish item
from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5e8349a3
2010-07-19 20:44:18,360 INFO  fetcher.Fetcher - fetch of
http://www.ifunia.com/download/ifunia-avchd-converter.dmg failed with:
java.lang.NullPointerException
2010-07-19 20:44:18,361 ERROR fetcher.Fetcher -
java.lang.NullPointerException
2010-07-19 20:44:25,596 ERROR fetcher.Fetcher - at
java.lang.System.arraycopy(Native Method)
2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1
108)
2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1
025)
2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:263)
2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:243)
2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
org.apache.hadoop.io.Text.write(Text.java:281)
2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.ser
ialize(WritableSerialization.java:90)
2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.ser
ialize(WritableSerialization.java:77)
2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:892)
2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466
)
2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:898)
2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:767)
2010-07-19 20:44:25,597 ERROR fetcher.Fetcher - fetcher
caught:java.lang.NullPointerException
2010-07-19 20:44:25,597 WARN  fetcher.Fetcher - Attempting to finish item
from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@5e8349a3
2010-07-19 20:44:25,597 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=60

It's now 21:55 and the line above is still the last line in the hadoop.log
file

So basically over 10 hours after the  fetcher.timelimit.mins was hit the
process has still not terminated and seems to be hanging up on its threads.

I'm not sure what should be happening here.  I don't want to kill the
process and lose the work that has been done at this point.  This has
happened in every case where I have put the fetcher.timelimit.mins in place.
If I don't put fetcher.timelimit.mins  in place I have to choose a
relatively small topN (100k) to get any results in a 24hour period.

The configuration changes I have are very basic:
Threads = 200
topN is not specified
plugin.includes =
protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(basic|anc
hor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|u
rlnormalizer-(pass|regex|basic)
fetcher.timelimit.mins = 1440 (12 hours)
generate.max.count = 100
fetcher.max.crawl.delay = 10
db.fetch.retry.max = 2
http.content.limit = 1024000
http.timeout = 5000

I have varied threads and generate max count, but no matter what I choose
the process slows from 15+ urls a second in the first couple hours to less
that 1 url a second within 5 - 10 hours.

That is why I implemented the fetcher.timelimit.mins in hopes of stopping
the process and starting again to get back to a reasonable performance.  But
that appears to be a dead end because I can't get the process to terminate.
At this rate, the termination is going to take longer than the original
fetch run time.

On a side note, based on my testing I have a hunch that the issues may
possibly be coming from tika.  My original tests which ran without the same
issues did not use tika in the plugin.includes

protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|s
ite|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|
regex|basic)

When I switched to my plugin.includes to

protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(basic|anc
hor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|u
rlnormalizer-(pass|regex|basic)

The problems started.  I don't know if they are related, but it's a hunch.

Running on a Xeon X3220 @.2.4 ghz, 8 GB ram and about 1tb of diskspace and
Centos 5.5.  10 mbps connection.  Nutch/Solr/Tomcat are the only real things
running on the box and they are only running in support of Nutch.

Your help would be appreciated!

Thanks
Brad

Nutch 1.1: Issue Using fetcher.timelimit.mins and fetch performance

Reply via email to