On 2010-10-28 19:58, Rob Hunter wrote:
> Andrzej, do you know if the results will be evenly distributed?  In
> that setup, will there be five depth 2 sites returned from each seed
> url, or could it end up being 100 from the first and none from the
> last 20?

There's no guarantee they will be evenly distributed - this depends on
the relative "importance" (score) of the pages. However, you can limit
the max. number of pages per host that you want to have in a fetchlist,
which helps to balance the fetching process.

> 
> There's also a property called db.max.outlinks.per.page - can that be
> used to limit the number of depth 2 sites fetched?

Yes, though that's not the primary purpose of this property - it's
rather to limit resource consumption during updatedb operation, the size
of the linkdb, etc. But the secondary purpose is indeed to limit an
undue impact of pages with many outlinks, so that they don't overwhelm
crawldb with their links.


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to