Hello Michael - see inline. Markus -----Original message----- > From:Michael Coffey <[email protected]> > Sent: Tuesday 4th April 2017 21:32 > To: [email protected] > Subject: Re: Speed of linkDB > > Thank you, Sebastian, that sounds like a great suggestion! You're right that > 3000 is a small segment size. I am using 3000 per slave just in this > still-early testing phase. I don't know the actual size of my linkdb, but my > crawldb has over 48 million urls so far, of which over 1.5 million have been > fetched.
If i remember correctly, LinkDB is filtering and normalizing by defeault. Disable it via noFilter and noNormalize to speed it all up quite a bit. Also, enable map file compression, it grealtly reduces IO. And, as Sebastian mentioned, do not run LinkDB on every segment, but once a day orso on all segments fetched that day. > > > I think I need the linkdb because incoming anchors are important for > search-engine relevance, right? In theory, yes. But a great many other things are probably much more important such as text extraction and analysis. If you reduce the number of inlinks per record to a few, you probably already have all the linking keywords. > > > ________________________________ > From: Sebastian Nagel <[email protected]> > > > Hi Michael, what is the size of your linkdb? If it's large (significantly > larger than the segment) > the reason is easily explained: the linkdb needs to be rewritten on every > invertlinks step. > That's an expensive action becoming more expensive for larger crawls. Unless > you really > need the linkdb to add anchor texts to your index you could: - either limit > the linkdb size by excluding internal links - or update it less frequently > (multiple segments in one turn) > A segment size of 3000 URLs seems small for a distributed crawl with a large > number of different > hosts or domains. You may observe similar problems updating the CrawlDb, > although later because > the CrawlDb is usually smaller, esp. if the linkdb includes also internal > links. Best, > Sebastian On 04/03/2017 02:08 AM, Michael Coffey wrote: > > In my situation, I find that linkdb merge takes much more time than fetch > > and parse combined, > even though fetch is fully polite. > > > > What is the standard advice for making linkdb-merge go faster? > > > > I call invertlinks like this: > > __bin_nutch invertlinks "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT > > > > invertlinks seems to call mergelinkdb automatically. > > > > I currently have about 3-6 slaves for fetching, though that will increase > > soon. I am > currently using small segment sizes (3000 urls) but I can increase that if it > would help. > > > > I have the following properties that may be relevant. > > > > <property> > > <name>db.max.outlinks.per.page</name> > > <value>1000</value> > > </property> > > > > <property> > > <name>db.ignore.external.links</name> > > <value>false</value> > > </property> > > > > > > The following props are left as default in nutch-default.xml > > > > <property> > > <name>db.update.max.inlinks</name> > > <value>10000</value> > > </property> > > > > <property> > > <name>db.ignore.internal.links</name> > > <value>false</value> > > </description> > > </property> > > > > <property> > > <name>db.ignore.external.links</name> > > <value>false</value> > > </description> > > </property> > > >

