Re: Re: fetch/parse twice?

feng lu Sun, 17 Feb 2013 18:17:15 -0800

Hi,

May be that url has generated three times. One reason is that the url
is reach the fetch time, so it will generate again. check your
fetchInterval is set correctly.  Another reason is that the fetcher
Markers doesn't remove the marker from the database, current marker is
still GENERATE_MARK.


You can run nutch comment step by step (generate->fetch-dbupdate) to
see what happens.

On 2/18/13, 高睿 <[email protected]> wrote:
> Hi,
>
> What do you mean the same directory? '/tmp' or '${NUTCH_HOME}'?
>
>
>
>
>
>
>
>
> At 2013-02-18 00:45:00,"Lewis John Mcgibbney" <[email protected]>
> wrote:
>>Hi,
>>Please make sure you have no temp files in the same directory and try again
>>Please either use the crawl script which is provided with nutch or
>>alternatively build your own script.
>>
>>
>>On Sunday, February 17, 2013, 高睿 <[email protected]> wrote:
>>> Hi,
>>> Additional, the nutch version is 2.1. And I have an ParserFilter to purge
>>outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});)
>>>
>>> When I specify '-depth 1', the url is only crawled once, and If I specify
>>'-depth 3', the url is crawled 3 times.
>>> Is this expected behavior? Should I use command 'crawl' to do all works
>>in one go?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> At 2013-02-17 22:11:22,"高睿" <[email protected]> wrote:
>>>>Hi,
>>>>
>>>>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl
>>-solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN
>>10000, then I find the url is crawled twice.
>>>>
>>>>Here's the log:
>>>> 55 2013-02-17 20:45:00,965 INFO  fetcher.FetcherJob - fetching
>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>> 84 2013-02-17 20:45:11,021 INFO  parse.ParserJob - Parsing
>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>215 2013-02-17 20:45:38,922 INFO  fetcher.FetcherJob - fetching
>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>244 2013-02-17 20:45:46,031 INFO  parse.ParserJob - Parsing
>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>
>>>>Do you know how to fix this?
>>>>Besides, when I run the command again. The same log is written in
>>hadoop.log. I don't know why the configuration 'db.fetch.interval.default'
>>in nutch-site.xml doesn't take effect.
>>>>
>>>>Thanks.
>>>>
>>>>Regards,
>>>>Rui
>>>
>>
>>--
>>*Lewis*
>


-- 
Don't Grow Old, Grow Up... :-)

Re: Re: fetch/parse twice?

Reply via email to