Somehow the defualt configuration defined in  nutch-default.xml is not
taken into account when you run the crawler on hadoop.

Some of the few things you can do:

1) Configure the nutch-site.xml and provide the necessary configurations
over there.
2) Also check the pluginIncludes properties, it should point to the correct
plugins folder...(give the absolute path)

And finally keep all the nutch configuration files either in
$HADOOP_HOME/conf folder or provide the path in HADOOP_CLASSPATH variable
in hadoop-env.sh file

Regards,
Som





On Wed, Jul 18, 2012 at 7:43 AM, 许春玲 <[email protected]> wrote:

> Hi,
>
>     When I run crawler of nutch 2.0 as command:
> hadoop jar /opt/nutch-2.0/runtime/deploy/apache-nutch-2.0.job
> org.apache.nutch.crawl.Crawler urls -dir output00 -depth 3 -topN 5 -threads
> 80
>
> there is error info like:
>
> 12/07/18 09:13:32 INFO mapred.JobClient: Task Id :
> attempt_201207101015_0091_m_000000_2, Status : FAILED
> java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not
> found.
>
> But the url regex in conf/regex-urlfilter.txt is correct:
>
> +^http://([a-z0-9]*\.)*apache.org
> +^http://([a-z0-9]*\.)*sina.com.cn
>
> so, what should I do?
>
> Thks.
>
> Ring
>
>
>
>

Reply via email to