Hello Michael - see inline.
Markus
 
-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Tuesday 4th April 2017 21:32
> To: [email protected]
> Subject: Re: Speed of linkDB
> 
> Thank you, Sebastian, that sounds like a great suggestion! You're right that 
> 3000 is a small segment size. I am using 3000 per slave just in this 
> still-early testing phase. I don't know the actual size of my linkdb, but my 
> crawldb has over 48 million urls so far, of which over 1.5 million have been 
> fetched.

If i remember correctly, LinkDB is filtering and normalizing by defeault. 
Disable it via noFilter and noNormalize to speed it all up quite a bit. Also, 
enable map file compression, it grealtly reduces IO. And, as Sebastian 
mentioned, do not run LinkDB on every segment, but once a day orso on all 
segments fetched that day.

> 
> 
> I think I need the linkdb because incoming anchors are important for 
> search-engine relevance, right?

In theory, yes. But a great many other things are probably much more important 
such as text extraction and analysis. If you reduce the number of inlinks per 
record to a few, you probably already have all the linking keywords.


> 
> 
> ________________________________
> From: Sebastian Nagel <[email protected]>
> 
> 
> Hi Michael, what is the size of your linkdb? If it's large (significantly 
> larger than the segment)
> the reason is easily explained: the linkdb needs to be rewritten on every 
> invertlinks step.
> That's an expensive action becoming more expensive for larger crawls. Unless 
> you really
> need the linkdb to add anchor texts to your index you could: - either limit 
> the linkdb size by excluding internal links - or update it less frequently 
> (multiple segments in one turn)
> A segment size of 3000 URLs seems small for a distributed crawl with a large 
> number of different
> hosts or domains. You may observe similar problems updating the CrawlDb, 
> although later because
> the CrawlDb is usually smaller, esp. if the linkdb includes also internal 
> links. Best,
> Sebastian On 04/03/2017 02:08 AM, Michael Coffey wrote:
> > In my situation, I find that linkdb merge takes much more time than fetch 
> > and parse combined,
> even though fetch is fully polite.
> > 
> > What is the standard advice for making linkdb-merge go faster?
> > 
> > I call invertlinks like this:
> > __bin_nutch invertlinks "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
> > 
> > invertlinks  seems to call mergelinkdb automatically.
> > 
> > I currently have about 3-6 slaves for fetching, though that will increase 
> > soon. I am
> currently using small segment sizes (3000 urls) but I can increase that if it 
> would help.
> > 
> > I have the following properties that may be relevant.
> > 
> > <property>
> >  <name>db.max.outlinks.per.page</name>
> >  <value>1000</value>
> > </property>
> > 
> > <property>
> >  <name>db.ignore.external.links</name>
> >  <value>false</value>
> > </property>
> > 
> > 
> > The following props are left as default in nutch-default.xml
> > 
> > <property>
> >  <name>db.update.max.inlinks</name>
> >  <value>10000</value>
> > </property>
> > 
> > <property>
> >  <name>db.ignore.internal.links</name>
> >  <value>false</value>
> >  </description>
> > </property>
> > 
> > <property>
> >  <name>db.ignore.external.links</name>
> >  <value>false</value>
> >  </description>
> > </property>
> > 
> 

Reply via email to