Re: Threads

Sebastian Nagel Mon, 17 Feb 2014 04:27:32 -0800

Hi Vangelis,

please, open a new thread (in sense of mailing) for a new topic.


Thanks,
Sebastian

On 02/17/2014 12:12 PM, Vangelis karv wrote:
> My exact problem is the following: I want to make a scoring function that 
> whenever a URL contains an .jpg image, the URL's score is increased by 10. In 
> method distributeScoreToOutlinks i added these: 
> 
> for(ScoreDatum free : scoreData){
>           try{
>             String aleos = free.getUrl();
>                 
>           if(aleos.contains(".jpg"))
>           {  
>               adjust+=10.0f;
>           }
>                   
>       }catch(Exception e){}
>       
>       }
>       
>       float aleks = row.getScore();
> 
>       row.setScore(aleks+adjust);
> 
> For example, http://www.uefa.com/ contains ~25 .jpg images and has score ~251 
> with my scoring plugin. At the depth 2, that score goes to 502, at  depth 3 
> 1004 e.t.c. . 
> I want that page's score to stay at 251 and not be refetched and reupdated. I 
> think my problem is that Nutch at the beginning of the loop cycle, reupdates 
> http://www.uefa.com/ which is my prime URL.
> 
> Any ideas?
> Thank you in advance!
> 
> From: [email protected]
> To: [email protected]
> Subject: RE: Threads
> Date: Mon, 17 Feb 2014 11:28:43 +0200
> 
> 
> 
> 
> Thank you Sebastian for your trouble!
> 
> I forgot to mention that I am using Nutch 2.2.1 and i can't find 
> http.redirect.max. I guess that it is only in 1.x.
> Any ideas on how to answer my 1st question? (I do not want the same page to 
> be refetched).
> 
>> Date: Sun, 16 Feb 2014 14:52:20 +0100
>> From: [email protected]
>> To: [email protected]
>> Subject: Re: Threads
>>
>> Hi Vangelis,
>>
>>> 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at 
>>> the next cycle?
>> Yes, if http.redirect.max == 0 (wich is the default).
>>
>>> 2) I understand that the whole set of urls to be fetched is saved at 
>>> QueueFeeder.
>> QueueFeeder reads the generated list of urls chunk by chunk and (re-)feeds 
>> FetchItemQueues which is
>> a map <host/domain/ip, FetchItemQueue>. If the total number of urls is long 
>> it never stored entirely
>> in memory.
>>
>>> Each thread will be assigned a number of urls to fetch equal to: 
>>> (wholeSetToBeFetched) /
>>> (numberOfThreads) ?
>> After having fetched a url, a FetcherThread asks for a new URL. If it does 
>> not get one because all
>> queues are blocked for politeness, it sleeps a second and tries again. The 
>> exact number of urls
>> processed by a thread is random, but ideally the number should be approx. 
>> equal for each thread. Of
>> course, there should not be much more threads than queues (hosts, domains, 
>> ips), at least, if
>> fetcher.threads.per.queue == 1.
>>
>> Sebastian
>>
>>
>> On 02/14/2014 01:20 PM, Vangelis karv wrote:
>>> Thank you Marcus for your fast response! 
>>> 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at 
>>> the next cycle?
>>> 2) I understand that the whole set of urls to be fetched is saved at 
>>> QueueFeeder. Each thread will be assigned a number of urls to fetch equal 
>>> to: (wholeSetToBeFetched) / (numberOfThreads) ?
>>>
>>> Happy Valentine's Day!
>>>
>>>> Subject: RE: Threads
>>>> From: [email protected]
>>>> To: [email protected]
>>>> Date: Fri, 14 Feb 2014 11:45:16 +0000
>>>>
>>>> Hi,
>>>>
>>>> They take records or (FetchItems) from the QueueFeeder. Queues are based 
>>>> on domain, host or ip and a URL exists only once, so nothing collides. The 
>>>> redirect will be followed in the next fetch cycle.
>>>>
>>>> Markus
>>>>
>>>>  
>>>>  
>>>> -----Original message-----
>>>>> From:Vangelis karv <[email protected]>
>>>>> Sent: Friday 14th February 2014 12:39
>>>>> To: [email protected]
>>>>> Subject: Threads
>>>>>
>>>>> Hello people!
>>>>>
>>>>> Lets say we choose 20 threads to fetch. How do they cooperate? I mean, 
>>>>> who tells them what pages each one of them will fetch?
>>>>> Is it possible some of them to collide or fetch the same page without 
>>>>> them knowing? 
>>>>> I read the code and found that if the redirect is to the same page, it 
>>>>> will not follow that redirect. Any advice would be very helpful! 
>>>>>
>>>>> Vangelis
>>>>
>>>                                       
>>>
>>
>                                                                               
>   
>

Re: Threads

Reply via email to