Re: fetch/parse twice?

Lewis John Mcgibbney Sun, 17 Feb 2013 08:45:31 -0800

Hi,
Please make sure you have no temp files in the same directory and try again
Please either use the crawl script which is provided with nutch or
alternatively build your own script.



On Sunday, February 17, 2013, 高睿 <[email protected]> wrote:
> Hi,
> Additional, the nutch version is 2.1. And I have an ParserFilter to purge
outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});)
>
> When I specify '-depth 1', the url is only crawled once, and If I specify
'-depth 3', the url is crawled 3 times.
> Is this expected behavior? Should I use command 'crawl' to do all works
in one go?
>
>
>
>
>
>
>
> At 2013-02-17 22:11:22,"高睿" <[email protected]> wrote:
>>Hi,
>>
>>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl
-solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN
10000, then I find the url is crawled twice.
>>
>>Here's the log:
>> 55 2013-02-17 20:45:00,965 INFO  fetcher.FetcherJob - fetching
http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>> 84 2013-02-17 20:45:11,021 INFO  parse.ParserJob - Parsing
http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>215 2013-02-17 20:45:38,922 INFO  fetcher.FetcherJob - fetching
http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>244 2013-02-17 20:45:46,031 INFO  parse.ParserJob - Parsing
http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>
>>Do you know how to fix this?
>>Besides, when I run the command again. The same log is written in
hadoop.log. I don't know why the configuration 'db.fetch.interval.default'
in nutch-site.xml doesn't take effect.
>>
>>Thanks.
>>
>>Regards,
>>Rui
>

-- 
*Lewis*

Re: fetch/parse twice?

Reply via email to