Hi Michael, what is the size of your linkdb? If it's large (significantly larger than the segment) the reason is easily explained: the linkdb needs to be rewritten on every invertlinks step. That's an expensive action becoming more expensive for larger crawls. Unless you really need the linkdb to add anchor texts to your index you could: - either limit the linkdb size by excluding internal links - or update it less frequently (multiple segments in one turn) A segment size of 3000 URLs seems small for a distributed crawl with a large number of different hosts or domains. You may observe similar problems updating the CrawlDb, although later because the CrawlDb is usually smaller, esp. if the linkdb includes also internal links.
Best, Sebastian On 04/03/2017 02:08 AM, Michael Coffey wrote: > In my situation, I find that linkdb merge takes much more time than fetch and > parse combined, even though fetch is fully polite. > > What is the standard advice for making linkdb-merge go faster? > > I call invertlinks like this: > __bin_nutch invertlinks "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT > > invertlinks seems to call mergelinkdb automatically. > > I currently have about 3-6 slaves for fetching, though that will increase > soon. I am currently using small segment sizes (3000 urls) but I can increase > that if it would help. > > I have the following properties that may be relevant. > > <property> > <name>db.max.outlinks.per.page</name> > <value>1000</value> > </property> > > <property> > <name>db.ignore.external.links</name> > <value>false</value> > </property> > > > The following props are left as default in nutch-default.xml > > <property> > <name>db.update.max.inlinks</name> > <value>10000</value> > </property> > > <property> > <name>db.ignore.internal.links</name> > <value>false</value> > </description> > </property> > > <property> > <name>db.ignore.external.links</name> > <value>false</value> > </description> > </property> >

