Thank you Sebastian for your trouble! I forgot to mention that I am using Nutch 2.2.1 and i can't find http.redirect.max. I guess that it is only in 1.x. Any ideas on how to answer my 1st question? (I do not want the same page to be refetched).
> Date: Sun, 16 Feb 2014 14:52:20 +0100 > From: [email protected] > To: [email protected] > Subject: Re: Threads > > Hi Vangelis, > > > 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at > > the next cycle? > Yes, if http.redirect.max == 0 (wich is the default). > > > 2) I understand that the whole set of urls to be fetched is saved at > > QueueFeeder. > QueueFeeder reads the generated list of urls chunk by chunk and (re-)feeds > FetchItemQueues which is > a map <host/domain/ip, FetchItemQueue>. If the total number of urls is long > it never stored entirely > in memory. > > > Each thread will be assigned a number of urls to fetch equal to: > > (wholeSetToBeFetched) / > > (numberOfThreads) ? > After having fetched a url, a FetcherThread asks for a new URL. If it does > not get one because all > queues are blocked for politeness, it sleeps a second and tries again. The > exact number of urls > processed by a thread is random, but ideally the number should be approx. > equal for each thread. Of > course, there should not be much more threads than queues (hosts, domains, > ips), at least, if > fetcher.threads.per.queue == 1. > > Sebastian > > > On 02/14/2014 01:20 PM, Vangelis karv wrote: > > Thank you Marcus for your fast response! > > 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at > > the next cycle? > > 2) I understand that the whole set of urls to be fetched is saved at > > QueueFeeder. Each thread will be assigned a number of urls to fetch equal > > to: (wholeSetToBeFetched) / (numberOfThreads) ? > > > > Happy Valentine's Day! > > > >> Subject: RE: Threads > >> From: [email protected] > >> To: [email protected] > >> Date: Fri, 14 Feb 2014 11:45:16 +0000 > >> > >> Hi, > >> > >> They take records or (FetchItems) from the QueueFeeder. Queues are based > >> on domain, host or ip and a URL exists only once, so nothing collides. The > >> redirect will be followed in the next fetch cycle. > >> > >> Markus > >> > >> > >> > >> -----Original message----- > >>> From:Vangelis karv <[email protected]> > >>> Sent: Friday 14th February 2014 12:39 > >>> To: [email protected] > >>> Subject: Threads > >>> > >>> Hello people! > >>> > >>> Lets say we choose 20 threads to fetch. How do they cooperate? I mean, > >>> who tells them what pages each one of them will fetch? > >>> Is it possible some of them to collide or fetch the same page without > >>> them knowing? > >>> I read the code and found that if the redirect is to the same page, it > >>> will not follow that redirect. Any advice would be very helpful! > >>> > >>> Vangelis > >> > > > > >

