Re:Re: fetch/parse twice?

高睿 Sun, 17 Feb 2013 23:24:37 -0800

The urls dir is not specified in the command.

bin/nutch crawl -solr http://localhost:8080/solr/collection2 -threads 10 -depth 
2 -topN 10000








在 2013-02-18 09:53:33，"Lewis John Mcgibbney" <[email protected]> 写道：
>Wherever your url directory is kept
>
>On Sunday, February 17, 2013, 高睿 <[email protected]> wrote:
>> Hi,
>>
>> What do you mean the same directory? '/tmp' or '${NUTCH_HOME}'?
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2013-02-18 00:45:00,"Lewis John Mcgibbney" <[email protected]>
>wrote:
>>>Hi,
>>>Please make sure you have no temp files in the same directory and try
>again
>>>Please either use the crawl script which is provided with nutch or
>>>alternatively build your own script.
>>>
>>>
>>>On Sunday, February 17, 2013, 高睿 <[email protected]> wrote:
>>>> Hi,
>>>> Additional, the nutch version is 2.1. And I have an ParserFilter to
>purge
>>>outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});)
>>>>
>>>> When I specify '-depth 1', the url is only crawled once, and If I
>specify
>>>'-depth 3', the url is crawled 3 times.
>>>> Is this expected behavior? Should I use command 'crawl' to do all works
>>>in one go?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> At 2013-02-17 22:11:22,"高睿" <[email protected]> wrote:
>>>>>Hi,
>>>>>
>>>>>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl
>>>-solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN
>>>10000, then I find the url is crawled twice.
>>>>>
>>>>>Here's the log:
>>>>> 55 2013-02-17 20:45:00,965 INFO  fetcher.FetcherJob - fetching
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>> 84 2013-02-17 20:45:11,021 INFO  parse.ParserJob - Parsing
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>>215 2013-02-17 20:45:38,922 INFO  fetcher.FetcherJob - fetching
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>>244 2013-02-17 20:45:46,031 INFO  parse.ParserJob - Parsing
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>>
>>>>>Do you know how to fix this?
>>>>>Besides, when I run the command again. The same log is written in
>>>hadoop.log. I don't know why the configuration 'db.fetch.interval.default'
>>>in nutch-site.xml doesn't take effect.
>>>>>
>>>>>Thanks.
>>>>>
>>>>>Regards,
>>>>>Rui
>>>>
>>>
>>>--
>>>*Lewis*
>>
>
>-- 
>*Lewis*

Re:Re: fetch/parse twice?

Reply via email to