Hi,
Additional, the nutch version is 2.1. And I have an ParserFilter to purge
outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});)
When I specify '-depth 1', the url is only crawled once, and If I specify
'-depth 3', the url is crawled 3 times.
Is this expected behavior? Should I use command 'crawl' to do all works in one
go?
At 2013-02-17 22:11:22,"高睿" <[email protected]> wrote:
>Hi,
>
>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl -solr
>http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN 10000, then
>I find the url is crawled twice.
>
>Here's the log:
> 55 2013-02-17 20:45:00,965 INFO fetcher.FetcherJob - fetching
> http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
> 84 2013-02-17 20:45:11,021 INFO parse.ParserJob - Parsing
> http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>215 2013-02-17 20:45:38,922 INFO fetcher.FetcherJob - fetching
>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>244 2013-02-17 20:45:46,031 INFO parse.ParserJob - Parsing
>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>
>Do you know how to fix this?
>Besides, when I run the command again. The same log is written in hadoop.log.
>I don't know why the configuration 'db.fetch.interval.default' in
>nutch-site.xml doesn't take effect.
>
>Thanks.
>
>Regards,
>Rui