[Nutch Wiki] Update of "NutchHadoopTutorial" by ilgiz

Apache Wiki Wed, 18 Nov 2009 09:24:07 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchHadoopTutorial" page has been changed by ilgiz.
The comment on this change is: max outlinks per page.
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=14&rev2=15

--------------------------------------------------

    
    * This tutorial worked well for me, however, I ran into a problem where my 
crawl wasn't working.   Turned out, it was because I needed to set the user 
agent and other properties for the crawl.  If anyone is reading this, and 
running into the same problem, look at the updated tutorial 
http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial?highlight=%28hadoop%29%7C%28tutorial%29
  
+ ----
+ 
+ * By default Nutch will read only the first 100 links on a page.  This will 
result in incomplete indexes when scanning file trees.  So I set the "max 
outlinks per page" option to -1 in nutch-site.conf and got complete indexes.
+ {{{
+ <property>
+   <name>db.max.outlinks.per.page</name>
+   <value>-1</value>
+   <description>The maximum number of outlinks that we'll process for a page.
+   If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
outlinks
+   will be processed for a page; otherwise, all outlinks will be processed.
+   </description>
+ </property>
+ }}}
+

[Nutch Wiki] Update of "NutchHadoopTutorial" by ilgiz

Reply via email to