Re:Re: fetch/parse twice?

高睿 Sun, 17 Feb 2013 16:45:57 -0800

Hi,

What do you mean the same directory? '/tmp' or '${NUTCH_HOME}'?









At 2013-02-18 00:45:00,"Lewis John Mcgibbney" <[email protected]> wrote:
>Hi,
>Please make sure you have no temp files in the same directory and try again
>Please either use the crawl script which is provided with nutch or
>alternatively build your own script.
>
>
>On Sunday, February 17, 2013, 高睿 <[email protected]> wrote:
>> Hi,
>> Additional, the nutch version is 2.1. And I have an ParserFilter to purge
>outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});)
>>
>> When I specify '-depth 1', the url is only crawled once, and If I specify
>'-depth 3', the url is crawled 3 times.
>> Is this expected behavior? Should I use command 'crawl' to do all works
>in one go?
>>
>>
>>
>>
>>
>>
>>
>> At 2013-02-17 22:11:22,"高睿" <[email protected]> wrote:
>>>Hi,
>>>
>>>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl
>-solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN
>10000, then I find the url is crawled twice.
>>>
>>>Here's the log:
>>> 55 2013-02-17 20:45:00,965 INFO  fetcher.FetcherJob - fetching
>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>> 84 2013-02-17 20:45:11,021 INFO  parse.ParserJob - Parsing
>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>215 2013-02-17 20:45:38,922 INFO  fetcher.FetcherJob - fetching
>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>244 2013-02-17 20:45:46,031 INFO  parse.ParserJob - Parsing
>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>
>>>Do you know how to fix this?
>>>Besides, when I run the command again. The same log is written in
>hadoop.log. I don't know why the configuration 'db.fetch.interval.default'
>in nutch-site.xml doesn't take effect.
>>>
>>>Thanks.
>>>
>>>Regards,
>>>Rui
>>
>
>-- 
>*Lewis*

Re:Re: fetch/parse twice?

Reply via email to