Re: shouldFetch rejected

Sebastian Nagel Sun, 25 Nov 2012 12:03:26 -0800

> But, i create a complete new crawl dir for every crawl.
Then all should work as expected.


> why the the cralwer set a "page to fetch" to rejected. Because obviously
> the crawler never saw this page before (because i deleted all the old crawl 
> dirs).
> In the crawl log i see many page to fetch, but at the end all of them are 
> rejected
Are you sure they aren't fetched at all? This debug log output in Generator 
mapper
is shown also for URLs fetched in previous cycles. You should check the complete
log for the "rejected" URLs.


On 11/24/2012 04:46 PM, Jan Philippe Wimmer wrote:
> Hey Sebastian! Thanks for your answer.
> 
> But, i create a complete new crawl dir for every crawl. In other words i just 
> have the crawl data of
> the current, running crawl-process. When i recrawl a urlset, i delete the old 
> crawl dir and create a
> new one. At the end of any crawl i index it to solr. So i keep all crawled 
> content in the index. I
> don't need any nutch crawl dirs, because i want to crawl all relevant pages 
> in every crawl process.
> again and again.
> 
> I totaly don't understand, why the the cralwer set a "page to fetch" to 
> rejected. Because obviously
> the crawler never saw this page before (because i deleted all the old crawl 
> dirs). In the crawl log
> i see many page to fetch, but at the end all of them are rejected. Any ideas?
> 
> Am 24.11.2012 16:36, schrieb Sebastian Nagel:
>>> I want my crawler to crawl the complete page without setting up schedulers 
>>> at all. Every crawl
>>> process should crawl every page again without having setup wait intervals.
>> That's quite easy: remove all data and launch the crawl again.
>> - Nutch 1.x : remove crawldb, segments, and linkdb
>> - 2.x : drop 'webpage' (or similar, depends on the chosen data store)
>>
>> On 11/24/2012 12:17 PM, Jan Philippe Wimmer wrote:
>>> Hi there,
>>>
>>> how can i avoid the following error:
>>> -shouldFetch rejected 'http://www.page.com/shop', fetchTime=1356347311285, 
>>> curTime=1353755337755
>>>
>>> I want my crawler to crawl the complete page without setting up schedulers 
>>> at all. Every crawl
>>> process should crawl every page again without having setup wait intervals.
>>>
>>> Any soluti

Re: shouldFetch rejected

Reply via email to