On 2010-10-28 19:58, Rob Hunter wrote: > Andrzej, do you know if the results will be evenly distributed? In > that setup, will there be five depth 2 sites returned from each seed > url, or could it end up being 100 from the first and none from the > last 20?
There's no guarantee they will be evenly distributed - this depends on the relative "importance" (score) of the pages. However, you can limit the max. number of pages per host that you want to have in a fetchlist, which helps to balance the fetching process. > > There's also a property called db.max.outlinks.per.page - can that be > used to limit the number of depth 2 sites fetched? Yes, though that's not the primary purpose of this property - it's rather to limit resource consumption during updatedb operation, the size of the linkdb, etc. But the secondary purpose is indeed to limit an undue impact of pages with many outlinks, so that they don't overwhelm crawldb with their links. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

