Hi Talat,
I think it's something worth doing since it would boost crawling
significantly.
I will see what I can do and will start a jira once I have something.

Thanks,
Diaa


On Mon, May 12, 2014 at 12:22 AM, Talat Uyarer <[email protected]> wrote:

> Hi Diaa,
>
> Good question, but now that is impossible. When you use topN parameter
> Nutch pays attend to list that ordered by score. If you want to take
> same number for each host, you can use different webpage table. Or If
> you are willing develop this feature for Nutch I can help you
>
> Talat
>
> 2014-05-11 18:00 GMT+03:00 Diaa Abdallah <[email protected]>:
> > When crawling multiple websites how would I enforce that nutch generates
> > and adds to the fetch list multiple hosts rather than the same host.
> >
> > For example:
> > Lets say we have 3 websites with the following discovered pages:
> > a.com 100
> > b.com 100
> > c.com 100
> >
> > When I generate topN 30 I'd like to make sure that these 30 are
> > proportional from each page so that it would take:
> > 10 from a.com
> > 10 from b.com
> > 10 from c.com
> >
> > Rather than take 30 from just a.com
> >
> > This happens when the webpages from a.com have a better score.
> > The harm here lies in that if only a.com generates the pages the crawl
> > would have less throughput since it takes longer to retrieve 30 from
> just a
> > rather than 10 from each, since there is a delay for each time the host
> is
> > crawled.
> >
> > Regards,
> > Diaa
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>

Reply via email to