RE: Threads

Vangelis karv Mon, 17 Feb 2014 03:13:34 -0800

My exact problem is the following: I want to make a scoring function that 
whenever a URL contains an .jpg image, the URL's score is increased by 10. In 
method distributeScoreToOutlinks i added these:


for(ScoreDatum free : scoreData){
          try{
            String aleos = free.getUrl();
                
          if(aleos.contains(".jpg"))
          {  
              adjust+=10.0f;
          }
                  
      }catch(Exception e){}
      
      }
      
      float aleks = row.getScore();

      row.setScore(aleks+adjust);

For example, http://www.uefa.com/ contains ~25 .jpg images and has score ~251 
with my scoring plugin. At the depth 2, that score goes to 502, at  depth 3 
1004 e.t.c. . 
I want that page's score to stay at 251 and not be refetched and reupdated. I 
think my problem is that Nutch at the beginning of the loop cycle, reupdates 
http://www.uefa.com/ which is my prime URL.

Any ideas?
Thank you in advance!

From: [email protected]
To: [email protected]
Subject: RE: Threads
Date: Mon, 17 Feb 2014 11:28:43 +0200




Thank you Sebastian for your trouble!

I forgot to mention that I am using Nutch 2.2.1 and i can't find 
http.redirect.max. I guess that it is only in 1.x.
Any ideas on how to answer my 1st question? (I do not want the same page to be 
refetched).

> Date: Sun, 16 Feb 2014 14:52:20 +0100
> From: [email protected]
> To: [email protected]
> Subject: Re: Threads
> 
> Hi Vangelis,
> 
> > 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at 
> > the next cycle?
> Yes, if http.redirect.max == 0 (wich is the default).
> 
> > 2) I understand that the whole set of urls to be fetched is saved at 
> > QueueFeeder.
> QueueFeeder reads the generated list of urls chunk by chunk and (re-)feeds 
> FetchItemQueues which is
> a map <host/domain/ip, FetchItemQueue>. If the total number of urls is long 
> it never stored entirely
> in memory.
> 
> > Each thread will be assigned a number of urls to fetch equal to: 
> > (wholeSetToBeFetched) /
> > (numberOfThreads) ?
> After having fetched a url, a FetcherThread asks for a new URL. If it does 
> not get one because all
> queues are blocked for politeness, it sleeps a second and tries again. The 
> exact number of urls
> processed by a thread is random, but ideally the number should be approx. 
> equal for each thread. Of
> course, there should not be much more threads than queues (hosts, domains, 
> ips), at least, if
> fetcher.threads.per.queue == 1.
> 
> Sebastian
> 
> 
> On 02/14/2014 01:20 PM, Vangelis karv wrote:
> > Thank you Marcus for your fast response! 
> > 1) If www.somesite.com redirects to www.somesite.com, will it fetch it at 
> > the next cycle?
> > 2) I understand that the whole set of urls to be fetched is saved at 
> > QueueFeeder. Each thread will be assigned a number of urls to fetch equal 
> > to: (wholeSetToBeFetched) / (numberOfThreads) ?
> > 
> > Happy Valentine's Day!
> > 
> >> Subject: RE: Threads
> >> From: [email protected]
> >> To: [email protected]
> >> Date: Fri, 14 Feb 2014 11:45:16 +0000
> >>
> >> Hi,
> >>
> >> They take records or (FetchItems) from the QueueFeeder. Queues are based 
> >> on domain, host or ip and a URL exists only once, so nothing collides. The 
> >> redirect will be followed in the next fetch cycle.
> >>
> >> Markus
> >>
> >>  
> >>  
> >> -----Original message-----
> >>> From:Vangelis karv <[email protected]>
> >>> Sent: Friday 14th February 2014 12:39
> >>> To: [email protected]
> >>> Subject: Threads
> >>>
> >>> Hello people!
> >>>
> >>> Lets say we choose 20 threads to fetch. How do they cooperate? I mean, 
> >>> who tells them what pages each one of them will fetch?
> >>> Is it possible some of them to collide or fetch the same page without 
> >>> them knowing? 
> >>> I read the code and found that if the redirect is to the same page, it 
> >>> will not follow that redirect. Any advice would be very helpful! 
> >>>
> >>> Vangelis
> >>
> >                                       
> > 
>

RE: Threads

Reply via email to