On 2010-09-29 01:13, Steve Cohen wrote:
fc08b7f0 * *java/util/regex/Matcher.search(I)Z [compiled] +174 (line 2208)
fc12c078 * *java/util/regex/Matcher.find()Z [compiled] +132 (line 1058)
fc12c078 *
*org/apache/nutch/urlfilter/regex/RegexURLFilter$Rule.match(Ljava/lang/String;)Z+18
(line 180)
fc12c078 *
*org/apache/nutch/urlfilter/api/RegexURLFilterBase.filter(Ljava/lang/String;)Ljava/lang/String;+38
(line 234)
fc12c078 *
*org/apache/nutch/net/URLFilters.filter(Ljava/lang/String;)Ljava/lang/String;+50
(line 184)
fc12c078 *
*org/apache/nutch/parse/ParseOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/parse/Parse;)V+992
(line 555)
fc16a680 *
*org/apache/nutch/parse/ParseOutputFormat$1.write(Ljava/lang/Object;Ljava/lang/Object;)V
[compiled] +20 (line 226)
fc16a680 *
*org/apache/nutch/fetcher/FetcherOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/NutchWritable;)V+120
This fragment of the stacktrace suggests two things:
* you are running Fetcher in parsing mode. This is discouraged - if you
encounter any issue with the parsing and it's stuck or crashes then you
will have to re-fetch from scratch...
* regex urlfiltering can be slow at times - there are many weird URL-s
out there, my favorite one was 64kB long and consisted partially of NULL
characters... Java regex may work VERY VERY slow on such URLs, so slow
that the task appears to hang, and sometimes TaskTracker thinks it is
really hung and kills it. For large crawls I tend to avoid regex
urlfilter, instead use a combination of prefix / suffix / domain /
custom filtering that don't use regex or first sanitize the urls.
I have a feeling I know why It is only using one core. I set
mapred.tasktracker.reduce.tasks.maximum to 4 but I see that there is a
setting for mapred.reduce.tasks which is set to 1. Do I need to up it to 4
as well?
Yes. The first property specifies how many reduce tasks a tasktracker
can run, but the second property says what is the default number of
reduce tasks in a job (jobs may override this setting, but usually
don't, so this will be usually the number of reducers per job).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com