My exact problem is the following: I want to make a scoring function that
whenever a URL contains an .jpg image, the URL's score is increased by 10. In
method distributeScoreToOutlinks i added these:
for(ScoreDatum free : scoreData){
try{
String aleos = free.getUrl();
if(aleos.contains(".jpg"))
{
adjust+=10.0f;
}
}catch(Exception e){}
}
float aleks = row.getScore();
row.setScore(aleks+adjust);
For example, http://www.uefa.com/ contains ~25 .jpg images and has score ~251
with my scoring plugin. At the depth 2, that score goes to 502, at depth 3
1004 e.t.c. .
I want that page's score to stay at 251 and not be refetched and reupdated. I
think my problem is that Nutch at the beginning of the loop cycle, reupdates
http://www.uefa.com/ which is my prime URL.
Any ideas?
Thank you in advance!
From: [email protected]
To: [email protected]
Subject: RE: Threads
Date: Mon, 17 Feb 2014 11:28:43 +0200
Thank you Sebastian for your trouble!
I forgot to mention that I am using Nutch 2.2.1 and i can't find
http.redirect.max. I guess that it is only in 1.x.
Any ideas on how to answer my 1st question? (I do not want the same page to be
refetched).
> Date: Sun, 16 Feb 2014 14:52:20 +0100
> From: [email protected]
> To: [email protected]
> Subject: Re: Threads
>
> Hi Vangelis,
>
> > 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at
> > the next cycle?
> Yes, if http.redirect.max == 0 (wich is the default).
>
> > 2) I understand that the whole set of urls to be fetched is saved at
> > QueueFeeder.
> QueueFeeder reads the generated list of urls chunk by chunk and (re-)feeds
> FetchItemQueues which is
> a map <host/domain/ip, FetchItemQueue>. If the total number of urls is long
> it never stored entirely
> in memory.
>
> > Each thread will be assigned a number of urls to fetch equal to:
> > (wholeSetToBeFetched) /
> > (numberOfThreads) ?
> After having fetched a url, a FetcherThread asks for a new URL. If it does
> not get one because all
> queues are blocked for politeness, it sleeps a second and tries again. The
> exact number of urls
> processed by a thread is random, but ideally the number should be approx.
> equal for each thread. Of
> course, there should not be much more threads than queues (hosts, domains,
> ips), at least, if
> fetcher.threads.per.queue == 1.
>
> Sebastian
>
>
> On 02/14/2014 01:20 PM, Vangelis karv wrote:
> > Thank you Marcus for your fast response!
> > 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at
> > the next cycle?
> > 2) I understand that the whole set of urls to be fetched is saved at
> > QueueFeeder. Each thread will be assigned a number of urls to fetch equal
> > to: (wholeSetToBeFetched) / (numberOfThreads) ?
> >
> > Happy Valentine's Day!
> >
> >> Subject: RE: Threads
> >> From: [email protected]
> >> To: [email protected]
> >> Date: Fri, 14 Feb 2014 11:45:16 +0000
> >>
> >> Hi,
> >>
> >> They take records or (FetchItems) from the QueueFeeder. Queues are based
> >> on domain, host or ip and a URL exists only once, so nothing collides. The
> >> redirect will be followed in the next fetch cycle.
> >>
> >> Markus
> >>
> >>
> >>
> >> -----Original message-----
> >>> From:Vangelis karv <[email protected]>
> >>> Sent: Friday 14th February 2014 12:39
> >>> To: [email protected]
> >>> Subject: Threads
> >>>
> >>> Hello people!
> >>>
> >>> Lets say we choose 20 threads to fetch. How do they cooperate? I mean,
> >>> who tells them what pages each one of them will fetch?
> >>> Is it possible some of them to collide or fetch the same page without
> >>> them knowing?
> >>> I read the code and found that if the redirect is to the same page, it
> >>> will not follow that redirect. Any advice would be very helpful!
> >>>
> >>> Vangelis
> >>
> >
> >
>