In my situation, I find that linkdb merge takes much more time than fetch and 
parse combined, even though fetch is fully polite.

What is the standard advice for making linkdb-merge go faster?

I call invertlinks like this:
__bin_nutch invertlinks "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT

invertlinks  seems to call mergelinkdb automatically.

I currently have about 3-6 slaves for fetching, though that will increase soon. 
I am currently using small segment sizes (3000 urls) but I can increase that if 
it would help.

I have the following properties that may be relevant.

<property>
  <name>db.max.outlinks.per.page</name>
  <value>1000</value>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
</property>


The following props are left as default in nutch-default.xml

<property>
  <name>db.update.max.inlinks</name>
  <value>10000</value>
</property>

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  </description>
</property>

Reply via email to