Hi

There are 2 parameters that are used exactly for this purpose
 (generate.max.count && generator.count.mode).  Look at nutch-default.xml
for a description. These are available in both versions of Nutch and allow
you to set a max number of URLs from the same host/domain/IP in a
fetchlist.

Julien



On 11 May 2014 23:22, Talat Uyarer <[email protected]> wrote:

> Hi Diaa,
>
> Good question, but now that is impossible. When you use topN parameter
> Nutch pays attend to list that ordered by score. If you want to take
> same number for each host, you can use different webpage table. Or If
> you are willing develop this feature for Nutch I can help you
>
> Talat
>
> 2014-05-11 18:00 GMT+03:00 Diaa Abdallah <[email protected]>:
> > When crawling multiple websites how would I enforce that nutch generates
> > and adds to the fetch list multiple hosts rather than the same host.
> >
> > For example:
> > Lets say we have 3 websites with the following discovered pages:
> > a.com 100
> > b.com 100
> > c.com 100
> >
> > When I generate topN 30 I'd like to make sure that these 30 are
> > proportional from each page so that it would take:
> > 10 from a.com
> > 10 from b.com
> > 10 from c.com
> >
> > Rather than take 30 from just a.com
> >
> > This happens when the webpages from a.com have a better score.
> > The harm here lies in that if only a.com generates the pages the crawl
> > would have less throughput since it takes longer to retrieve 30 from
> just a
> > rather than 10 from each, since there is a delay for each time the host
> is
> > crawled.
> >
> > Regards,
> > Diaa
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to