Re:Re: Re: fetch/parse twice?

高睿 Sun, 17 Feb 2013 23:28:48 -0800

Hi,

I have such configuration in nutch-site.xml:
        <property>
                <name>db.fetch.interval.default</name>
                <value>2592000</value>
                <description>The default number of seconds between re-fetches 
of a page (30 days).
                </description>
        </property>
       
</configuration>


So, I guess the fetch interval is configured correctly. Still, I don't know why 
this configuration doesn't take effect.







At 2013-02-18 10:16:47,"feng lu" <[email protected]> wrote:
>Hi,
>
>May be that url has generated three times. One reason is that the url
>is reach the fetch time, so it will generate again. check your
>fetchInterval is set correctly.  Another reason is that the fetcher
>Markers doesn't remove the marker from the database, current marker is
>still GENERATE_MARK.
>
>You can run nutch comment step by step (generate->fetch-dbupdate) to
>see what happens.
>
>On 2/18/13, 高睿 <[email protected]> wrote:
>> Hi,
>>
>> What do you mean the same directory? '/tmp' or '${NUTCH_HOME}'?
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2013-02-18 00:45:00,"Lewis John Mcgibbney" <[email protected]>
>> wrote:
>>>Hi,
>>>Please make sure you have no temp files in the same directory and try again
>>>Please either use the crawl script which is provided with nutch or
>>>alternatively build your own script.
>>>
>>>
>>>On Sunday, February 17, 2013, 高睿 <[email protected]> wrote:
>>>> Hi,
>>>> Additional, the nutch version is 2.1. And I have an ParserFilter to purge
>>>outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});)
>>>>
>>>> When I specify '-depth 1', the url is only crawled once, and If I specify
>>>'-depth 3', the url is crawled 3 times.
>>>> Is this expected behavior? Should I use command 'crawl' to do all works
>>>in one go?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> At 2013-02-17 22:11:22,"高睿" <[email protected]> wrote:
>>>>>Hi,
>>>>>
>>>>>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl
>>>-solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN
>>>10000, then I find the url is crawled twice.
>>>>>
>>>>>Here's the log:
>>>>> 55 2013-02-17 20:45:00,965 INFO  fetcher.FetcherJob - fetching
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>> 84 2013-02-17 20:45:11,021 INFO  parse.ParserJob - Parsing
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>>215 2013-02-17 20:45:38,922 INFO  fetcher.FetcherJob - fetching
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>>244 2013-02-17 20:45:46,031 INFO  parse.ParserJob - Parsing
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>>
>>>>>Do you know how to fix this?
>>>>>Besides, when I run the command again. The same log is written in
>>>hadoop.log. I don't know why the configuration 'db.fetch.interval.default'
>>>in nutch-site.xml doesn't take effect.
>>>>>
>>>>>Thanks.
>>>>>
>>>>>Regards,
>>>>>Rui
>>>>
>>>
>>>--
>>>*Lewis*
>>
>
>
>-- 
>Don't Grow Old, Grow Up... :-)

Re:Re: Re: fetch/parse twice?

Reply via email to