Re:fetch/parse twice?

高睿 Sun, 17 Feb 2013 08:07:04 -0800

Hi,

Additional, the nutch version is 2.1. And I have an ParserFilter to purge 
outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});)


When I specify '-depth 1', the url is only crawled once, and If I specify 
'-depth 3', the url is crawled 3 times.
Is this expected behavior? Should I use command 'crawl' to do all works in one 
go?







At 2013-02-17 22:11:22,"高睿" <[email protected]> wrote:
>Hi,
>
>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl -solr 
>http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN 10000, then 
>I find the url is crawled twice.
>
>Here's the log:
> 55 2013-02-17 20:45:00,965 INFO  fetcher.FetcherJob - fetching 
> http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
> 84 2013-02-17 20:45:11,021 INFO  parse.ParserJob - Parsing 
> http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>215 2013-02-17 20:45:38,922 INFO  fetcher.FetcherJob - fetching 
>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>244 2013-02-17 20:45:46,031 INFO  parse.ParserJob - Parsing 
>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>
>Do you know how to fix this?
>Besides, when I run the command again. The same log is written in hadoop.log. 
>I don't know why the configuration 'db.fetch.interval.default' in 
>nutch-site.xml doesn't take effect.
>
>Thanks.
>
>Regards,
>Rui

Re:fetch/parse twice?

Reply via email to