Re: Threads

Sebastian Nagel Mon, 17 Feb 2014 04:25:26 -0800

> I forgot to mention that I am using Nutch 2.2.1 and i can't find 
> http.redirect.max. I guess that
it is only in 1.x.
Yes.
> Any ideas on how to answer my 1st question? (I do not want the same page to 
> be refetched).
For 2.x redirects are only recorded, never followed immediately.
If a page has already been fetched it will not get re-fetched again (only after 
some "longer" time).


On 02/17/2014 10:28 AM, Vangelis karv wrote:
> Thank you Sebastian for your trouble!
> 
> I forgot to mention that I am using Nutch 2.2.1 and i can't find 
> http.redirect.max. I guess that it is only in 1.x.
> Any ideas on how to answer my 1st question? (I do not want the same page to 
> be refetched).
> 
>> Date: Sun, 16 Feb 2014 14:52:20 +0100
>> From: [email protected]
>> To: [email protected]
>> Subject: Re: Threads
>>
>> Hi Vangelis,
>>
>>> 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at 
>>> the next cycle?
>> Yes, if http.redirect.max == 0 (wich is the default).
>>
>>> 2) I understand that the whole set of urls to be fetched is saved at 
>>> QueueFeeder.
>> QueueFeeder reads the generated list of urls chunk by chunk and (re-)feeds 
>> FetchItemQueues which is
>> a map <host/domain/ip, FetchItemQueue>. If the total number of urls is long 
>> it never stored entirely
>> in memory.
>>
>>> Each thread will be assigned a number of urls to fetch equal to: 
>>> (wholeSetToBeFetched) /
>>> (numberOfThreads) ?
>> After having fetched a url, a FetcherThread asks for a new URL. If it does 
>> not get one because all
>> queues are blocked for politeness, it sleeps a second and tries again. The 
>> exact number of urls
>> processed by a thread is random, but ideally the number should be approx. 
>> equal for each thread. Of
>> course, there should not be much more threads than queues (hosts, domains, 
>> ips), at least, if
>> fetcher.threads.per.queue == 1.
>>
>> Sebastian
>>
>>
>> On 02/14/2014 01:20 PM, Vangelis karv wrote:
>>> Thank you Marcus for your fast response! 
>>> 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at 
>>> the next cycle?
>>> 2) I understand that the whole set of urls to be fetched is saved at 
>>> QueueFeeder. Each thread will be assigned a number of urls to fetch equal 
>>> to: (wholeSetToBeFetched) / (numberOfThreads) ?
>>>
>>> Happy Valentine's Day!
>>>
>>>> Subject: RE: Threads
>>>> From: [email protected]
>>>> To: [email protected]
>>>> Date: Fri, 14 Feb 2014 11:45:16 +0000
>>>>
>>>> Hi,
>>>>
>>>> They take records or (FetchItems) from the QueueFeeder. Queues are based 
>>>> on domain, host or ip and a URL exists only once, so nothing collides. The 
>>>> redirect will be followed in the next fetch cycle.
>>>>
>>>> Markus
>>>>
>>>>  
>>>>  
>>>> -----Original message-----
>>>>> From:Vangelis karv <[email protected]>
>>>>> Sent: Friday 14th February 2014 12:39
>>>>> To: [email protected]
>>>>> Subject: Threads
>>>>>
>>>>> Hello people!
>>>>>
>>>>> Lets say we choose 20 threads to fetch. How do they cooperate? I mean, 
>>>>> who tells them what pages each one of them will fetch?
>>>>> Is it possible some of them to collide or fetch the same page without 
>>>>> them knowing? 
>>>>> I read the code and found that if the redirect is to the same page, it 
>>>>> will not follow that redirect. Any advice would be very helpful! 
>>>>>
>>>>> Vangelis
>>>>
>>>                                       
>>>
>>
>                                         
>

Re: Threads

Reply via email to