Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchHadoopTutorial" page has been changed by ilgiz. The comment on this change is: max outlinks per page. http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=14&rev2=15 -------------------------------------------------- * This tutorial worked well for me, however, I ran into a problem where my crawl wasn't working. Turned out, it was because I needed to set the user agent and other properties for the crawl. If anyone is reading this, and running into the same problem, look at the updated tutorial http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial?highlight=%28hadoop%29%7C%28tutorial%29 + ---- + + * By default Nutch will read only the first 100 links on a page. This will result in incomplete indexes when scanning file trees. So I set the "max outlinks per page" option to -1 in nutch-site.conf and got complete indexes. + {{{ + <property> + <name>db.max.outlinks.per.page</name> + <value>-1</value> + <description>The maximum number of outlinks that we'll process for a page. + If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks + will be processed for a page; otherwise, all outlinks will be processed. + </description> + </property> + }}} +