Hi Vangelis,
please, open a new thread (in sense of mailing) for a new topic.
Thanks,
Sebastian
On 02/17/2014 12:12 PM, Vangelis karv wrote:
> My exact problem is the following: I want to make a scoring function that
> whenever a URL contains an .jpg image, the URL's score is increased by 10. In
> method distributeScoreToOutlinks i added these:
>
> for(ScoreDatum free : scoreData){
> try{
> String aleos = free.getUrl();
>
> if(aleos.contains(".jpg"))
> {
> adjust+=10.0f;
> }
>
> }catch(Exception e){}
>
> }
>
> float aleks = row.getScore();
>
> row.setScore(aleks+adjust);
>
> For example, http://www.uefa.com/ contains ~25 .jpg images and has score ~251
> with my scoring plugin. At the depth 2, that score goes to 502, at depth 3
> 1004 e.t.c. .
> I want that page's score to stay at 251 and not be refetched and reupdated. I
> think my problem is that Nutch at the beginning of the loop cycle, reupdates
> http://www.uefa.com/ which is my prime URL.
>
> Any ideas?
> Thank you in advance!
>
> From: [email protected]
> To: [email protected]
> Subject: RE: Threads
> Date: Mon, 17 Feb 2014 11:28:43 +0200
>
>
>
>
> Thank you Sebastian for your trouble!
>
> I forgot to mention that I am using Nutch 2.2.1 and i can't find
> http.redirect.max. I guess that it is only in 1.x.
> Any ideas on how to answer my 1st question? (I do not want the same page to
> be refetched).
>
>> Date: Sun, 16 Feb 2014 14:52:20 +0100
>> From: [email protected]
>> To: [email protected]
>> Subject: Re: Threads
>>
>> Hi Vangelis,
>>
>>> 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at
>>> the next cycle?
>> Yes, if http.redirect.max == 0 (wich is the default).
>>
>>> 2) I understand that the whole set of urls to be fetched is saved at
>>> QueueFeeder.
>> QueueFeeder reads the generated list of urls chunk by chunk and (re-)feeds
>> FetchItemQueues which is
>> a map <host/domain/ip, FetchItemQueue>. If the total number of urls is long
>> it never stored entirely
>> in memory.
>>
>>> Each thread will be assigned a number of urls to fetch equal to:
>>> (wholeSetToBeFetched) /
>>> (numberOfThreads) ?
>> After having fetched a url, a FetcherThread asks for a new URL. If it does
>> not get one because all
>> queues are blocked for politeness, it sleeps a second and tries again. The
>> exact number of urls
>> processed by a thread is random, but ideally the number should be approx.
>> equal for each thread. Of
>> course, there should not be much more threads than queues (hosts, domains,
>> ips), at least, if
>> fetcher.threads.per.queue == 1.
>>
>> Sebastian
>>
>>
>> On 02/14/2014 01:20 PM, Vangelis karv wrote:
>>> Thank you Marcus for your fast response!
>>> 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at
>>> the next cycle?
>>> 2) I understand that the whole set of urls to be fetched is saved at
>>> QueueFeeder. Each thread will be assigned a number of urls to fetch equal
>>> to: (wholeSetToBeFetched) / (numberOfThreads) ?
>>>
>>> Happy Valentine's Day!
>>>
>>>> Subject: RE: Threads
>>>> From: [email protected]
>>>> To: [email protected]
>>>> Date: Fri, 14 Feb 2014 11:45:16 +0000
>>>>
>>>> Hi,
>>>>
>>>> They take records or (FetchItems) from the QueueFeeder. Queues are based
>>>> on domain, host or ip and a URL exists only once, so nothing collides. The
>>>> redirect will be followed in the next fetch cycle.
>>>>
>>>> Markus
>>>>
>>>>
>>>>
>>>> -----Original message-----
>>>>> From:Vangelis karv <[email protected]>
>>>>> Sent: Friday 14th February 2014 12:39
>>>>> To: [email protected]
>>>>> Subject: Threads
>>>>>
>>>>> Hello people!
>>>>>
>>>>> Lets say we choose 20 threads to fetch. How do they cooperate? I mean,
>>>>> who tells them what pages each one of them will fetch?
>>>>> Is it possible some of them to collide or fetch the same page without
>>>>> them knowing?
>>>>> I read the code and found that if the redirect is to the same page, it
>>>>> will not follow that redirect. Any advice would be very helpful!
>>>>>
>>>>> Vangelis
>>>>
>>>
>>>
>>
>
>
>