> I forgot to mention that I am using Nutch 2.2.1 and i can't find > http.redirect.max. I guess that it is only in 1.x. Yes. > Any ideas on how to answer my 1st question? (I do not want the same page to > be refetched). For 2.x redirects are only recorded, never followed immediately. If a page has already been fetched it will not get re-fetched again (only after some "longer" time).
On 02/17/2014 10:28 AM, Vangelis karv wrote: > Thank you Sebastian for your trouble! > > I forgot to mention that I am using Nutch 2.2.1 and i can't find > http.redirect.max. I guess that it is only in 1.x. > Any ideas on how to answer my 1st question? (I do not want the same page to > be refetched). > >> Date: Sun, 16 Feb 2014 14:52:20 +0100 >> From: [email protected] >> To: [email protected] >> Subject: Re: Threads >> >> Hi Vangelis, >> >>> 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at >>> the next cycle? >> Yes, if http.redirect.max == 0 (wich is the default). >> >>> 2) I understand that the whole set of urls to be fetched is saved at >>> QueueFeeder. >> QueueFeeder reads the generated list of urls chunk by chunk and (re-)feeds >> FetchItemQueues which is >> a map <host/domain/ip, FetchItemQueue>. If the total number of urls is long >> it never stored entirely >> in memory. >> >>> Each thread will be assigned a number of urls to fetch equal to: >>> (wholeSetToBeFetched) / >>> (numberOfThreads) ? >> After having fetched a url, a FetcherThread asks for a new URL. If it does >> not get one because all >> queues are blocked for politeness, it sleeps a second and tries again. The >> exact number of urls >> processed by a thread is random, but ideally the number should be approx. >> equal for each thread. Of >> course, there should not be much more threads than queues (hosts, domains, >> ips), at least, if >> fetcher.threads.per.queue == 1. >> >> Sebastian >> >> >> On 02/14/2014 01:20 PM, Vangelis karv wrote: >>> Thank you Marcus for your fast response! >>> 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at >>> the next cycle? >>> 2) I understand that the whole set of urls to be fetched is saved at >>> QueueFeeder. Each thread will be assigned a number of urls to fetch equal >>> to: (wholeSetToBeFetched) / (numberOfThreads) ? >>> >>> Happy Valentine's Day! >>> >>>> Subject: RE: Threads >>>> From: [email protected] >>>> To: [email protected] >>>> Date: Fri, 14 Feb 2014 11:45:16 +0000 >>>> >>>> Hi, >>>> >>>> They take records or (FetchItems) from the QueueFeeder. Queues are based >>>> on domain, host or ip and a URL exists only once, so nothing collides. The >>>> redirect will be followed in the next fetch cycle. >>>> >>>> Markus >>>> >>>> >>>> >>>> -----Original message----- >>>>> From:Vangelis karv <[email protected]> >>>>> Sent: Friday 14th February 2014 12:39 >>>>> To: [email protected] >>>>> Subject: Threads >>>>> >>>>> Hello people! >>>>> >>>>> Lets say we choose 20 threads to fetch. How do they cooperate? I mean, >>>>> who tells them what pages each one of them will fetch? >>>>> Is it possible some of them to collide or fetch the same page without >>>>> them knowing? >>>>> I read the code and found that if the redirect is to the same page, it >>>>> will not follow that redirect. Any advice would be very helpful! >>>>> >>>>> Vangelis >>>> >>> >>> >> > >

