On 2010-09-29 01:13, Steve Cohen wrote:

  fc08b7f0 * *java/util/regex/Matcher.search(I)Z [compiled] +174 (line 2208)
  fc12c078 * *java/util/regex/Matcher.find()Z [compiled] +132 (line 1058)
  fc12c078 *
*org/apache/nutch/urlfilter/regex/RegexURLFilter$Rule.match(Ljava/lang/String;)Z+18
(line 180)
  fc12c078 *
*org/apache/nutch/urlfilter/api/RegexURLFilterBase.filter(Ljava/lang/String;)Ljava/lang/String;+38
(line 234)
  fc12c078 *
*org/apache/nutch/net/URLFilters.filter(Ljava/lang/String;)Ljava/lang/String;+50
(line 184)
  fc12c078 *
*org/apache/nutch/parse/ParseOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/parse/Parse;)V+992
(line 555)
  fc16a680 *
*org/apache/nutch/parse/ParseOutputFormat$1.write(Ljava/lang/Object;Ljava/lang/Object;)V
[compiled] +20 (line 226)
  fc16a680 *
*org/apache/nutch/fetcher/FetcherOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/NutchWritable;)V+120

This fragment of the stacktrace suggests two things:

* you are running Fetcher in parsing mode. This is discouraged - if you encounter any issue with the parsing and it's stuck or crashes then you will have to re-fetch from scratch...

* regex urlfiltering can be slow at times - there are many weird URL-s out there, my favorite one was 64kB long and consisted partially of NULL characters... Java regex may work VERY VERY slow on such URLs, so slow that the task appears to hang, and sometimes TaskTracker thinks it is really hung and kills it. For large crawls I tend to avoid regex urlfilter, instead use a combination of prefix / suffix / domain / custom filtering that don't use regex or first sanitize the urls.

I have a feeling I know why It is only using one core. I set
mapred.tasktracker.reduce.tasks.maximum to 4 but I see that there is a
setting for mapred.reduce.tasks which is set to 1. Do I need to up it to 4
as well?

Yes. The first property specifies how many reduce tasks a tasktracker can run, but the second property says what is the default number of reduce tasks in a job (jobs may override this setting, but usually don't, so this will be usually the number of reducers per job).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to