Re: Speed of linkDB

Michael Coffey Fri, 12 May 2017 12:10:02 -0700

I am curious about the noFilter and noNormalize options for linkdb, suggested 
by Marcus. What do the default normalize and filtering operations do, and what 
would I be losing by turning them off?
Still looking to speed up the process. Now using topN 96000 and doing linkdb on 
multiple segments per job. Surprised to see that the linkd-merge job seems to 
be CPU-bound, according to sysstat.



      From: Michael Coffey <[email protected]>
 To: "[email protected]" <[email protected]> 
 Sent: Tuesday, April 4, 2017 12:25 PM
 Subject: Re: Speed of linkDB
   
Thank you, Sebastian, that sounds like a great suggestion! You're right that 
3000 is a small segment size. I am using 3000 per slave just in this 
still-early testing phase. I don't know the actual size of my linkdb, but my 
crawldb has over 48 million urls so far, of which over 1.5 million have been 
fetched.


I think I need the linkdb because incoming anchors are important for 
search-engine relevance, right?


________________________________
From: Sebastian Nagel <[email protected]>


Hi Michael, what is the size of your linkdb? If it's large (significantly 
larger than the segment)
the reason is easily explained: the linkdb needs to be rewritten on every 
invertlinks step.
That's an expensive action becoming more expensive for larger crawls. Unless 
you really
need the linkdb to add anchor texts to your index you could: - either limit the 
linkdb size by excluding internal links - or update it less frequently 
(multiple segments in one turn)
A segment size of 3000 URLs seems small for a distributed crawl with a large 
number of different
hosts or domains. You may observe similar problems updating the CrawlDb, 
although later because
the CrawlDb is usually smaller, esp. if the linkdb includes also internal 
links. Best,
Sebastian On 04/03/2017 02:08 AM, Michael Coffey wrote:
> In my situation, I find that linkdb merge takes much more time than fetch and 
> parse combined,
even though fetch is fully polite.
> 
> What is the standard advice for making linkdb-merge go faster?
> 
> I call invertlinks like this:
> __bin_nutch invertlinks "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
> 
> invertlinks  seems to call mergelinkdb automatically.
> 
> I currently have about 3-6 slaves for fetching, though that will increase 
> soon. I am
currently using small segment sizes (3000 urls) but I can increase that if it 
would help.
> 
> I have the following properties that may be relevant.
> 
> <property>
>  <name>db.max.outlinks.per.page</name>
>  <value>1000</value>
> </property>
> 
> <property>
>  <name>db.ignore.external.links</name>
>  <value>false</value>
> </property>
> 
> 
> The following props are left as default in nutch-default.xml
> 
> <property>
>  <name>db.update.max.inlinks</name>
>  <value>10000</value>
> </property>
> 
> <property>
>  <name>db.ignore.internal.links</name>
>  <value>false</value>
>  </description>
> </property>
> 
> <property>
>  <name>db.ignore.external.links</name>
>  <value>false</value>
>  </description>
> </property>
>

Re: Speed of linkDB

Reply via email to