Sounds like you had db.max.outlinks.per.page set to 200 before you changed
it, but in my nutch-default.xml it defaulted to 100.

But anyway, because the directory index (which nutch sees as a page like any
other) had over 200 links, only the first 200 were queued for fetching.
 Since nothing else you were crawling had links to the other 2700 documents,
Nutch never picked them up.

I had a similar problem where it wouldn't pick up all the pages from an
index, but because those pages had "next" and "previous" links, it would
grab a couple pages per depth and take many depths to finish.
Changing db.max.outlinks.per.page
to -1 fixed that problem also.

-Mark


On Fri, Apr 15, 2011 at 2:54 PM, Melanie Drake <[email protected]>wrote:

> UPDATE: changing the value of db.max.outlinks.per.page to -1 seemed to fix
> this issue.
>
> Based on the description of this setting ("The maximum number of outlinks
> that we'll process for a page. ") I'm not 100% sure why this increases the
> number of files that are indexed.  I'm curious to know, but at least it's
> working now.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexed-Files-Limited-to-200-tp2825662p2825780.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to